[Snowball-discuss] Another puzzling French stemmer question

Martin Holmes mholmes at uvic.ca
Tue Jan 26 21:43:53 GMT 2021


Hi all,

I'm having trouble with the word égoïsme. The test data says that it 
should be stemmed to égoïsm, but I get égo. Here's the process:

Preflight: Replace ë and ï with He and Hi.
Result: égoHisme

Calculating RV, R1, R2:

RV: "If the word begins with two vowels, RV is the region after the 
third letter, otherwise the region after the first vowel not at the 
beginning of the word..."

Result: RV = Hisme

R1: "R1 is the region after the first non-vowel following a vowel..."

Result: R1 = oHisme

R2: "R2 is the region after the first non-vowel following a vowel in R1..."

Result: R2 = isme

Step 1: Search for the longest among the following suffixes, and perform 
the action indicated.
ance   iqUe   isme   able   iste   eux   ances   iqUes   ismes   ables 
istes
     delete if in R2

"isme" appears in R2, so we should get égo.

What am I misunderstanding here?

All help appreciated,
Martin




More information about the Snowball-discuss mailing list