[Snowball-discuss] French stemmer test data wrong?

Olly Betts olly at survex.com
Fri Jan 22 04:15:14 GMT 2021


Hi Martin,

On Thu, Jan 21, 2021 at 11:04:04AM -0800, Martin Holmes wrote:
> 2. In step 1, we have:
> 
>    ement   ements
>     delete if in RV
> 
> Clearly "ement" is in RV, so it should be deleted.
> 
> The only possibility I can see is that further down in step 1, we have this:
> 
> issement   issements
>     delete if in R1 and preceded by a non-vowel
> 
> We do have "issement" and it is in R1 (which is -aissement), but it is
> preceded by a vowel, so shouldn't be deleted.
> 
> So am I right in concluding that this particular test should fire first
> (because "issement" is longer than "ement"), and should then preclude the
> earlier match?

That's right - the intention is that we find "the longest among the
following suffixes" (which is "issement") "and perform the action
indicated" (which is "delete if in R1 and preceded by a non-vowel").
The condition in the action isn't satisfied which means nothing is
changed, but the action still "counts" - we don't then go on to look for
the next longest suffix and run its action as well.

It would be good to improve the description if it's not clear to people,
as it ideally ought to be possible to implement the algorithm from
scratch just from the description (though having implemented a couple of
stemmers myself from other people's descriptions, I know this can be
tricky.)

Which part were you misled by here - was it that "the following
suffixes" means all those listed in step 1, or that a no-effect
conditional action still counts as performed?  Or was it both?

Cheers,
    Olly



More information about the Snowball-discuss mailing list