[Snowball-discuss] French stemmer test data wrong?
Martin Holmes
mholmes at uvic.ca
Fri Jan 22 23:02:26 GMT 2021
I've now been able to confirm my supposition below, that for steps 2a
and 2b, only RV is searched for a match; this is different from the
behaviour in step 1, where the longest match is first found in the word,
but nothing may then happen because it turns out not to be in the
specified component of the word.
With this, I've now managed to get all the test set to generate
correctly, which is a great relief, but there's one more thing that
needs to be clarified, which took a little working out:
The replacements that take place in the prelude need to be done in
precisely the right order, otherwise the algorithm fails. This is the
XPath sequence I've ended up with, after having to switch things around
a few times till everything worked:
replace(
replace(
replace(
replace(
replace(
replace(
replace($token,
'y(' || $vowel || ')', 'Y$1'),
'(' || $vowel || ')y', '$1Y'),
'(' || $vowel || ')u(' || $vowel || ')', '$1U$2'),
'qu', 'qU'),
'(' || $vowel || ')i(' || $vowel || ')', '$1I$2'),
'ë', 'He'),
'ï', 'Hi')
The key point is that since (for example) 'y' is a vowel but Y is not,
changing y to Y affects the results of subsequent replacements, which
are based on vowels in the surrounding context. If (for example) i is
switched to I before y to Y, then
croyiez
becomes
croYIez
and the algorithm gives the wrong result. But if Y is switched first,
then it is
croYiez
and the correct result pops out.
The prose description should probably explain that the order of
operations is important here.
Cheers,
Martin
On 2021-01-22 11:43 a.m., Martin Holmes wrote:
> Olly has very kindly solved this problem for me -- you do in fact triage
> the entire set of suffixes by length initially, and act on the longest
> one found.
>
> I've now hit another problem that I really can't figure out. According
> to the test data, "alliez" should be stemmed to "alli". However, it
> comes out unchanged as "alliez" if we follow the rules, as far as I can
> see:
>
> "If the word begins with two vowels, RV is the region after the third
> letter, otherwise the region after the first vowel not at the beginning
> of the word, or the end of the word if these positions cannot be found."
>
> "alliez" does not begin with two vowels, so RV should be the region
> after the first vowel not at the beginning of the word:
>
> ez
>
> Step 1 does nothing, so step 2a runs; that does nothing, so step 2b runs.
>
> "In steps 2a and 2b all tests are confined to the RV region."
>
> The longest matching suffix in step 2b is "iez". However, this is not
> inside RV, which is -ez, so nothing should be done here either.
>
> No subsequent step triggers any change.
>
> The only thing I can imagine here is that the intention is not:
>
> - search for the longest suffix in the word and check whether it's in RV
>
> but rather:
>
> - search IN RV for the longest matching suffix
>
> in which case -ez would match instead, and would be deleted.
>
> Am I right in this assumption?
>
> Thanks,
> Martin Holmes
>
> On 2021-01-21 11:04 a.m., Martin Holmes wrote:
>> Hi all,
>>
>> I'm working on a French stemmer in XSLT, and stumbling over the word
>> abaissement.
>>
>> According to the test data, this word should not change during stemming:
>>
>> <https://raw.githubusercontent.com/snowballstem/snowball-data/master/french/voc.txt>
>>
>> <https://raw.githubusercontent.com/snowballstem/snowball-data/master/french/output.txt>
>>
>>
>> (It's line 8 in the input and output.)
>>
>> It seems odd that it shouldn't change, but as we know stemming
>> algorithms aren't perfect. However, it does get stemmed to abaiss in
>> my stemmer. If I follow the process:
>>
>> 1. Defining RV: "If the word begins with two vowels, RV is the region
>> after the third letter, otherwise the region after the first vowel not
>> at the beginning of the word, or the end of the word if these
>> positions cannot be found."
>>
>> So for abaissement, this should be the region after the first vowel
>> not at the beginning of the word:
>>
>> aba - issement
>>
>> 2. In step 1, we have:
>>
>> ement ements
>> delete if in RV
>>
>> Clearly "ement" is in RV, so it should be deleted.
>>
>> The only possibility I can see is that further down in step 1, we have
>> this:
>>
>> issement issements
>> delete if in R1 and preceded by a non-vowel
>>
>> We do have "issement" and it is in R1 (which is -aissement), but it is
>> preceded by a vowel, so shouldn't be deleted.
>>
>> So am I right in concluding that this particular test should fire
>> first (because "issement" is longer than "ement"), and should then
>> preclude the earlier match?
>>
>> All help appreciated,
>> Martin
>
>
>
> _______________________________________________
> Snowball-discuss mailing list
> Snowball-discuss at lists.tartarus.org
> https://lists.tartarus.org/mailman/listinfo/snowball-discuss
More information about the Snowball-discuss
mailing list