[Snowball-discuss] Removing été from French stop words

Olly Betts olly at survex.com
Thu Apr 16 04:25:19 BST 2020


On Wed, Apr 15, 2020 at 08:14:55PM -0400, Philippe Ouellet wrote:
> Could you give me the link to the file in the repo? I have no idea where that is.

algorithms/french/stop.txt

"git grep" is good for such situations:

$ git grep 'été'
algorithms/french/stop.txt:été
algorithms/french/stop.txt:étée
algorithms/french/stop.txt:étées
algorithms/french/stop.txt:étés

> I only did notice été, because one of our product name contains that
> word, but you are right about aura and avions. 
> 
> Do we need to remove avions from the stop word, if it get changed to
> its singular form during analysis?

Yes - the stop word lists as shipped are intended for stopping before
stemming.

You could run each entry through the stemmer to get a post-stemming
stopword list, though it will tend to have more issues with stopping
words that are wanted because each entry will now remove all words which
stem to the same thing as a stopword - e.g. "avions" stems to "avion"
and so both "avion" and "avions" would get treated as stopwords if you
stop after stemming.

> I am having second thought: removing été could have a great impact on
> the search result, someone searching for “summer” would result in
> finding all results containing the past tense form of “to be”: the
> impact is huge.

That's still better than not being able to search for "summer" at all.
There are similar issues in English and we explicitly don't include
words like "can", "may", "will", "must", etc in
algorithms/english/stop.txt (as noted in the comments).

For such words a search for just the word itself will tend to have
somewhat poor results (a French document that's actually talking about
"summer" will probably mention "été" more on average and so tend to rank
higher, but there are likely to be some documents that aren't about
"summer" at all ranked above some that are.)

Searches for more than one word are likely to fare better.  Especially
so if the ranking favours cases where the search terms appear close
together.

> Is there a way to make “a été” the stop word instead?

Not in the scope of a simple list of stop words.  I guess you could
try part of speech tagging to try to differentiate the cases where
the stop word and the homonym are different parts of speech (which
they typically seem to be).

Cheers,
    Olly



More information about the Snowball-discuss mailing list