[Snowball-discuss] Removing été from French stop words
Martin Holmes
mholmes at uvic.ca
Thu Apr 16 15:17:37 BST 2020
Hi all,
Any generic stopwords list will have problems like this, of course, but
there are other uses for stopwords. For example, for searching on a site
about John Keats, it makes sense to add "Keats" to the stopword list
used to constrain searches, since virtually every paragraph contains his
name, and running a search on it is resource-intensive and pointless.
Cheers,
Martin
On 2020-04-16 4:08 a.m., Martin Porter wrote:
> Philippe,
>
> Well noted.
>
> One might reasonably ask why the snowball language examples chose to
> include stopword lists anyway, since they are not really a part of
> stemming. But in the process of setting up the test vocabularies,
> ranked lists of words by frequency readily came out, from which a set
> of "particle" or "stop" words could easily be constructed. It seemed
> useful to present them.
>
> Originally, stopwords were words that were considered too common and
> too lacking in meaning to be worth indexing for retrieval purposes.
> But that is historical now, and for phrase searching they will be
> required. Indeed some imprortant phrases are made up entirely of these
> words ("to be or not to be" etc). But stopwords have other uses: they
> can be used to make a simple tool for language identification, they
> can be eliminated from terms to be used in automatic query expansion.
> Stopwords that are homonyms may still be useful in these extra
> contexts.
>
> My own feeling is that rather than removing these homonym-stopwords
> from the lists, they should be marked as being homonyms, and therefore
> as problematical. Whether they are eliminated from use will then
> depend on what the stopword list is being used for.
>
> Martin
>
> _______________________________________________
> Snowball-discuss mailing list
> Snowball-discuss at lists.tartarus.org
> https://lists.tartarus.org/mailman/listinfo/snowball-discuss
>
More information about the Snowball-discuss
mailing list