[Snowball-discuss] Removing été from French stop words

Philippe Ouellet philippe at camellia-sinensis.com
Thu Apr 16 13:11:08 BST 2020


How does asciifolding fit into this? 

I would like “mais” to be a stop word, but “maïs” should not (it means corn). “mais” has no other meaning than “but”, and should be a stop word.

The current list has “mais” in its list, should we comment it?

--
Philippe Ouellet
Web Developer
https://camellia-sinensis.com

> On Apr 15, 2020, at 23:25, Olly Betts <olly at survex.com> wrote:
> 
> On Wed, Apr 15, 2020 at 08:14:55PM -0400, Philippe Ouellet wrote:
>> Could you give me the link to the file in the repo? I have no idea where that is.
> 
> algorithms/french/stop.txt
> 
> "git grep" is good for such situations:
> 
> $ git grep 'été'
> algorithms/french/stop.txt:été
> algorithms/french/stop.txt:étée
> algorithms/french/stop.txt:étées
> algorithms/french/stop.txt:étés
> 
>> I only did notice été, because one of our product name contains that
>> word, but you are right about aura and avions. 
>> 
>> Do we need to remove avions from the stop word, if it get changed to
>> its singular form during analysis?
> 
> Yes - the stop word lists as shipped are intended for stopping before
> stemming.
> 
> You could run each entry through the stemmer to get a post-stemming
> stopword list, though it will tend to have more issues with stopping
> words that are wanted because each entry will now remove all words which
> stem to the same thing as a stopword - e.g. "avions" stems to "avion"
> and so both "avion" and "avions" would get treated as stopwords if you
> stop after stemming.
> 
>> I am having second thought: removing été could have a great impact on
>> the search result, someone searching for “summer” would result in
>> finding all results containing the past tense form of “to be”: the
>> impact is huge.
> 
> That's still better than not being able to search for "summer" at all.
> There are similar issues in English and we explicitly don't include
> words like "can", "may", "will", "must", etc in
> algorithms/english/stop.txt (as noted in the comments).
> 
> For such words a search for just the word itself will tend to have
> somewhat poor results (a French document that's actually talking about
> "summer" will probably mention "été" more on average and so tend to rank
> higher, but there are likely to be some documents that aren't about
> "summer" at all ranked above some that are.)
> 
> Searches for more than one word are likely to fare better.  Especially
> so if the ranking favours cases where the search terms appear close
> together.
> 
>> Is there a way to make “a été” the stop word instead?
> 
> Not in the scope of a simple list of stop words.  I guess you could
> try part of speech tagging to try to differentiate the cases where
> the stop word and the homonym are different parts of speech (which
> they typically seem to be).
> 
> Cheers,
>    Olly

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.tartarus.org/pipermail/snowball-discuss/attachments/20200416/e7b93c32/attachment.htm>


More information about the Snowball-discuss mailing list