[Snowball-discuss] French stemmer test data wrong?
Martin Porter
martin.f.porter at gmail.com
Sun Jan 24 20:48:48 GMT 2021
But in the definition of the French stemmer at snowballstem.org, I was
surprised by this,
"Replace ë and ï with He and Hi. The H marks the vowel as having
originally had a diaeresis, while the vowel itself, lacking an accent,
is able to match suffixes beginning in e or i."
and its subsequent ramifications. This is an addition to the original
stemmer. Surely the reason given is not the correct one: the 'H' adds
to the syllable length of a word which adjusts regions R1 and R2. One
sees it in English with
coop (one syllable)
coöp (two syllables)
coöp being an oldish spelling for co-op, short for cooperative. In the
test vocabulary at snowballstem.org, these are the only words
containing ë,
aiguë
aiguës
ambiguë
ambiguës
arbërisht
canoë
canoës
ciguë
contiguë
contiguës
exiguë
exiguës
gaëlique
israël
maërl
moëlle
noël
noëls
raëlien
raphaëlois
staël
subaiguë
about one word per 1,000. There are no suffixes here with ë in place
of e, nor is adjusting R1 and R2 is going to make any important
difference to the stemmer's performance. Where did this idea come
from?
More information about the Snowball-discuss
mailing list