[Snowball-discuss] Stemming Romanian with s&t with comma (vs cedilla)

Martin Porter martin.f.porter at gmail.com
Thu Mar 9 00:13:57 GMT 2023


Trey,

Thank you for pointing this out. (Sorry for the late reply.) When I
wrote the Romanian stemmer in 2007, the data I used certainly had s&t
with cedilla, and I was not aware of the s&t with comma form.

Some normalisation of the text strings given to the stemmers is
expected, and most especially the mapping of upper case letters to
lower case. But sometimes a bit of normalisation is done in the
stemmer. I think that is what should happen here.

Have a look at the 'prelude' routine in the Italian stemmer, and
compare with the 'prelude' routine in Romanian. In Italian, the rules
for the direction (acute or grave) of accents over vowels were quite
relaxed, and varied from one publishing house to another. Perhaps the
rules are tighter now, but there was no harm in mapping the acutes to
graves, which is done in the first repeat loop. We put the same idea
in the Romanian stemmer.

The question is, do we map cedilla to comma-under, or comma-under to
cedilla? Since comma-under is becoming standard, it would seem sensibe
to map cedilla to comma-under, but even so, I would suggest doing it
the other way round, so that users who have had to rely on the cedilla
representation will not notice any change. The stemmed forms will have
the cedilla form of s&t, but these should be hidden from view anyway.

Olly Betts manages github Snowball, and we must see what he thinks.

I could write the necessary patch, but it would take me some time to
test it out.

Martin



More information about the Snowball-discuss mailing list