[Snowball-discuss] French stemmer test data wrong?
Martin Holmes
mholmes at uvic.ca
Fri Jan 22 19:43:56 GMT 2021
Olly has very kindly solved this problem for me -- you do in fact triage
the entire set of suffixes by length initially, and act on the longest
one found.
I've now hit another problem that I really can't figure out. According
to the test data, "alliez" should be stemmed to "alli". However, it
comes out unchanged as "alliez" if we follow the rules, as far as I can see:
"If the word begins with two vowels, RV is the region after the third
letter, otherwise the region after the first vowel not at the beginning
of the word, or the end of the word if these positions cannot be found."
"alliez" does not begin with two vowels, so RV should be the region
after the first vowel not at the beginning of the word:
ez
Step 1 does nothing, so step 2a runs; that does nothing, so step 2b runs.
"In steps 2a and 2b all tests are confined to the RV region."
The longest matching suffix in step 2b is "iez". However, this is not
inside RV, which is -ez, so nothing should be done here either.
No subsequent step triggers any change.
The only thing I can imagine here is that the intention is not:
- search for the longest suffix in the word and check whether it's in RV
but rather:
- search IN RV for the longest matching suffix
in which case -ez would match instead, and would be deleted.
Am I right in this assumption?
Thanks,
Martin Holmes
On 2021-01-21 11:04 a.m., Martin Holmes wrote:
> Hi all,
>
> I'm working on a French stemmer in XSLT, and stumbling over the word
> abaissement.
>
> According to the test data, this word should not change during stemming:
>
> <https://raw.githubusercontent.com/snowballstem/snowball-data/master/french/voc.txt>
>
> <https://raw.githubusercontent.com/snowballstem/snowball-data/master/french/output.txt>
>
>
> (It's line 8 in the input and output.)
>
> It seems odd that it shouldn't change, but as we know stemming
> algorithms aren't perfect. However, it does get stemmed to abaiss in my
> stemmer. If I follow the process:
>
> 1. Defining RV: "If the word begins with two vowels, RV is the region
> after the third letter, otherwise the region after the first vowel not
> at the beginning of the word, or the end of the word if these positions
> cannot be found."
>
> So for abaissement, this should be the region after the first vowel not
> at the beginning of the word:
>
> aba - issement
>
> 2. In step 1, we have:
>
> ement ements
> delete if in RV
>
> Clearly "ement" is in RV, so it should be deleted.
>
> The only possibility I can see is that further down in step 1, we have
> this:
>
> issement issements
> delete if in R1 and preceded by a non-vowel
>
> We do have "issement" and it is in R1 (which is -aissement), but it is
> preceded by a vowel, so shouldn't be deleted.
>
> So am I right in concluding that this particular test should fire first
> (because "issement" is longer than "ement"), and should then preclude
> the earlier match?
>
> All help appreciated,
> Martin
More information about the Snowball-discuss
mailing list