[Snowball-discuss] French stemmer test data wrong?
Martin Holmes
mholmes at uvic.ca
Thu Jan 21 19:04:04 GMT 2021
Hi all,
I'm working on a French stemmer in XSLT, and stumbling over the word
abaissement.
According to the test data, this word should not change during stemming:
<https://raw.githubusercontent.com/snowballstem/snowball-data/master/french/voc.txt>
<https://raw.githubusercontent.com/snowballstem/snowball-data/master/french/output.txt>
(It's line 8 in the input and output.)
It seems odd that it shouldn't change, but as we know stemming
algorithms aren't perfect. However, it does get stemmed to abaiss in my
stemmer. If I follow the process:
1. Defining RV: "If the word begins with two vowels, RV is the region
after the third letter, otherwise the region after the first vowel not
at the beginning of the word, or the end of the word if these positions
cannot be found."
So for abaissement, this should be the region after the first vowel not
at the beginning of the word:
aba - issement
2. In step 1, we have:
ement ements
delete if in RV
Clearly "ement" is in RV, so it should be deleted.
The only possibility I can see is that further down in step 1, we have this:
issement issements
delete if in R1 and preceded by a non-vowel
We do have "issement" and it is in R1 (which is -aissement), but it is
preceded by a vowel, so shouldn't be deleted.
So am I right in concluding that this particular test should fire first
(because "issement" is longer than "ement"), and should then preclude
the earlier match?
All help appreciated,
Martin
More information about the Snowball-discuss
mailing list