<html><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class="">How does asciifolding fit into this? <div class=""><br class=""></div><div class="">I would like “mais” to be a stop word, but “maïs” should not (it means corn). “mais” has no other meaning than “but”, and should be a stop word.</div><div class=""><br class=""></div><div class="">The current list has “mais” in its list, should we comment it?</div><div class=""><br class=""><div class="">
<div dir="auto" style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0); letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none; word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class=""><div>--<br class="">Philippe Ouellet<br class="">Web Developer</div><div><a href="https://camellia-sinensis.com" class="">https://camellia-sinensis.com</a></div></div>
</div>
<div><br class=""><blockquote type="cite" class=""><div class="">On Apr 15, 2020, at 23:25, Olly Betts <<a href="mailto:olly@survex.com" class="">olly@survex.com</a>> wrote:</div><br class="Apple-interchange-newline"><div class=""><div class="">On Wed, Apr 15, 2020 at 08:14:55PM -0400, Philippe Ouellet wrote:<br class=""><blockquote type="cite" class="">Could you give me the link to the file in the repo? I have no idea where that is.<br class=""></blockquote><br class="">algorithms/french/stop.txt<br class=""><br class="">"git grep" is good for such situations:<br class=""><br class="">$ git grep 'été'<br class="">algorithms/french/stop.txt:été<br class="">algorithms/french/stop.txt:étée<br class="">algorithms/french/stop.txt:étées<br class="">algorithms/french/stop.txt:étés<br class=""><br class=""><blockquote type="cite" class="">I only did notice été, because one of our product name contains that<br class="">word, but you are right about aura and avions. <br class=""><br class="">Do we need to remove avions from the stop word, if it get changed to<br class="">its singular form during analysis?<br class=""></blockquote><br class="">Yes - the stop word lists as shipped are intended for stopping before<br class="">stemming.<br class=""><br class="">You could run each entry through the stemmer to get a post-stemming<br class="">stopword list, though it will tend to have more issues with stopping<br class="">words that are wanted because each entry will now remove all words which<br class="">stem to the same thing as a stopword - e.g. "avions" stems to "avion"<br class="">and so both "avion" and "avions" would get treated as stopwords if you<br class="">stop after stemming.<br class=""><br class=""><blockquote type="cite" class="">I am having second thought: removing été could have a great impact on<br class="">the search result, someone searching for “summer” would result in<br class="">finding all results containing the past tense form of “to be”: the<br class="">impact is huge.<br class=""></blockquote><br class="">That's still better than not being able to search for "summer" at all.<br class="">There are similar issues in English and we explicitly don't include<br class="">words like "can", "may", "will", "must", etc in<br class="">algorithms/english/stop.txt (as noted in the comments).<br class=""><br class="">For such words a search for just the word itself will tend to have<br class="">somewhat poor results (a French document that's actually talking about<br class="">"summer" will probably mention "été" more on average and so tend to rank<br class="">higher, but there are likely to be some documents that aren't about<br class="">"summer" at all ranked above some that are.)<br class=""><br class="">Searches for more than one word are likely to fare better. Especially<br class="">so if the ranking favours cases where the search terms appear close<br class="">together.<br class=""><br class=""><blockquote type="cite" class="">Is there a way to make “a été” the stop word instead?<br class=""></blockquote><br class="">Not in the scope of a simple list of stop words. I guess you could<br class="">try part of speech tagging to try to differentiate the cases where<br class="">the stop word and the homonym are different parts of speech (which<br class="">they typically seem to be).<br class=""><br class="">Cheers,<br class=""> Olly<br class=""></div></div></blockquote></div><br class=""></div></body></html>