[Snowball-discuss] French stemmer test data wrong?
Martin Holmes
mholmes at uvic.ca
Fri Jan 22 15:17:52 GMT 2021
Hi Olly,
I have implemented the English stemmer in XSLT and JS before, from the
descriptions (I don't read Snowball :-)), but I don't recall having a
long sequence of different sets of endings which first need to be
triaged by length; I may be wrong (must go and look), but IIRC the
length triage only happens for each set of suffixes which has a specific
action attached to it.
It might be clearer to start with a complete list of those suffixes, in
descending order of length (to make creating regexes easier for the
user) right at the top, with the message: "Find the longest of these
suffixes, then for that suffix, look up the appropriate action below;
act on only one suffix, then proceed to Step 2", followed by what's
there now.
Thanks for clarifying -- I'm moving forward now. 798 tests passing so far...
Cheers,
Martin
On 2021-01-21 8:15 p.m., Olly Betts wrote:
> Hi Martin,
>
> On Thu, Jan 21, 2021 at 11:04:04AM -0800, Martin Holmes wrote:
>> 2. In step 1, we have:
>>
>> ement ements
>> delete if in RV
>>
>> Clearly "ement" is in RV, so it should be deleted.
>>
>> The only possibility I can see is that further down in step 1, we have this:
>>
>> issement issements
>> delete if in R1 and preceded by a non-vowel
>>
>> We do have "issement" and it is in R1 (which is -aissement), but it is
>> preceded by a vowel, so shouldn't be deleted.
>>
>> So am I right in concluding that this particular test should fire first
>> (because "issement" is longer than "ement"), and should then preclude the
>> earlier match?
>
> That's right - the intention is that we find "the longest among the
> following suffixes" (which is "issement") "and perform the action
> indicated" (which is "delete if in R1 and preceded by a non-vowel").
> The condition in the action isn't satisfied which means nothing is
> changed, but the action still "counts" - we don't then go on to look for
> the next longest suffix and run its action as well.
>
> It would be good to improve the description if it's not clear to people,
> as it ideally ought to be possible to implement the algorithm from
> scratch just from the description (though having implemented a couple of
> stemmers myself from other people's descriptions, I know this can be
> tricky.)
>
> Which part were you misled by here - was it that "the following
> suffixes" means all those listed in step 1, or that a no-effect
> conditional action still counts as performed? Or was it both?
>
> Cheers,
> Olly
>
--
-------------------------------------
Humanities Computing and Media Centre
University of Victoria
mholmes at uvic.ca
More information about the Snowball-discuss
mailing list