[Snowball-discuss] French stemmer test data wrong?

Martin Holmes mholmes at uvic.ca
Thu Jan 21 19:04:04 GMT 2021


Hi all,

I'm working on a French stemmer in XSLT, and stumbling over the word 
abaissement.

According to the test data, this word should not change during stemming:

<https://raw.githubusercontent.com/snowballstem/snowball-data/master/french/voc.txt>
<https://raw.githubusercontent.com/snowballstem/snowball-data/master/french/output.txt>

(It's line 8 in the input and output.)

It seems odd that it shouldn't change, but as we know stemming 
algorithms aren't perfect. However, it does get stemmed to abaiss in my 
stemmer. If I follow the process:

1. Defining RV: "If the word begins with two vowels, RV is the region 
after the third letter, otherwise the region after the first vowel not 
at the beginning of the word, or the end of the word if these positions 
cannot be found."

So for abaissement, this should be the region after the first vowel not 
at the beginning of the word:

aba - issement

2. In step 1, we have:

    ement   ements
     delete if in RV

Clearly "ement" is in RV, so it should be deleted.

The only possibility I can see is that further down in step 1, we have this:

issement   issements
     delete if in R1 and preceded by a non-vowel

We do have "issement" and it is in R1 (which is -aissement), but it is 
preceded by a vowel, so shouldn't be deleted.

So am I right in concluding that this particular test should fire first 
(because "issement" is longer than "ement"), and should then preclude 
the earlier match?

All help appreciated,
Martin




More information about the Snowball-discuss mailing list