[Snowball-discuss] French stemmer test data wrong?

Martin Holmes mholmes at uvic.ca
Fri Jan 22 15:17:52 GMT 2021


Hi Olly,

I have implemented the English stemmer in XSLT and JS before, from the 
descriptions (I don't read Snowball :-)), but I don't recall having a 
long sequence of different sets of endings which first need to be 
triaged by length; I may be wrong (must go and look), but IIRC the 
length triage only happens for each set of suffixes which has a specific 
action attached to it.

It might be clearer to start with a complete list of those suffixes, in 
descending order of length (to make creating regexes easier for the 
user) right at the top, with the message: "Find the longest of these 
suffixes, then for that suffix, look up the appropriate action below; 
act on only one suffix, then proceed to Step 2", followed by what's 
there now.

Thanks for clarifying -- I'm moving forward now. 798 tests passing so far...

Cheers,
Martin

On 2021-01-21 8:15 p.m., Olly Betts wrote:
> Hi Martin,
> 
> On Thu, Jan 21, 2021 at 11:04:04AM -0800, Martin Holmes wrote:
>> 2. In step 1, we have:
>>
>>     ement   ements
>>      delete if in RV
>>
>> Clearly "ement" is in RV, so it should be deleted.
>>
>> The only possibility I can see is that further down in step 1, we have this:
>>
>> issement   issements
>>      delete if in R1 and preceded by a non-vowel
>>
>> We do have "issement" and it is in R1 (which is -aissement), but it is
>> preceded by a vowel, so shouldn't be deleted.
>>
>> So am I right in concluding that this particular test should fire first
>> (because "issement" is longer than "ement"), and should then preclude the
>> earlier match?
> 
> That's right - the intention is that we find "the longest among the
> following suffixes" (which is "issement") "and perform the action
> indicated" (which is "delete if in R1 and preceded by a non-vowel").
> The condition in the action isn't satisfied which means nothing is
> changed, but the action still "counts" - we don't then go on to look for
> the next longest suffix and run its action as well.
> 
> It would be good to improve the description if it's not clear to people,
> as it ideally ought to be possible to implement the algorithm from
> scratch just from the description (though having implemented a couple of
> stemmers myself from other people's descriptions, I know this can be
> tricky.)
> 
> Which part were you misled by here - was it that "the following
> suffixes" means all those listed in step 1, or that a no-effect
> conditional action still counts as performed?  Or was it both?
> 
> Cheers,
>      Olly
> 

-- 
-------------------------------------
Humanities Computing and Media Centre
University of Victoria
mholmes at uvic.ca



More information about the Snowball-discuss mailing list