[Snowball-discuss] Removing été from French stop words

Martin Porter martin.f.porter at gmail.com
Thu Apr 16 12:08:59 BST 2020


Philippe,

Well noted.

One might reasonably ask why the snowball language examples chose to
include stopword lists anyway, since they are not really a part of
stemming. But in the process of setting up the test vocabularies,
ranked lists of words by frequency readily came out, from which a set
of "particle" or "stop" words could easily be constructed. It seemed
useful to present them.

Originally, stopwords were words that were considered too common and
too lacking in meaning to be worth indexing for retrieval purposes.
But that is historical now, and for phrase searching they will be
required. Indeed some imprortant phrases are made up entirely of these
words ("to be or not to be" etc). But stopwords have other uses: they
can be used to make a simple tool for language identification, they
can be eliminated from terms to be used in automatic query expansion.
Stopwords that are homonyms may still be useful in these extra
contexts.

My own feeling is that rather than removing these homonym-stopwords
from the lists, they should be marked as being homonyms, and therefore
as problematical. Whether they are eliminated from use will then
depend on what the stopword list is being used for.

Martin



More information about the Snowball-discuss mailing list