[Snowball-discuss] Stemming Romanian with s&t with comma (vs cedilla)

Trey Jones tjones at wikimedia.org
Thu Mar 9 15:51:39 GMT 2023


Thanks for the reply, Martin.

Since this issue started downstream of Snowball, there have been other
discussions not on this list. Those discussions led to Robert Muir
submitting pull requests to snowball (#177) and snowball-data (#23) on
GitHub. He's opted to map to the comma forms, which I agree with; obviously
it could be wired the other way around if necessary.

Thanks for the pointer to Italian! That was the model that probably would
have helped me write my own patch, but Robert already had a handle on it so
I'm happy to let him take care of it.

—Trey
Trey Jones
Staff Computational Linguist, Search Platform
Wikimedia Foundation
UTC–5 / EST


On Wed, Mar 8, 2023 at 7:14 PM Martin Porter <martin.f.porter at gmail.com>
wrote:

> Trey,
>
> Thank you for pointing this out. (Sorry for the late reply.) When I
> wrote the Romanian stemmer in 2007, the data I used certainly had s&t
> with cedilla, and I was not aware of the s&t with comma form.
>
> Some normalisation of the text strings given to the stemmers is
> expected, and most especially the mapping of upper case letters to
> lower case. But sometimes a bit of normalisation is done in the
> stemmer. I think that is what should happen here.
>
> Have a look at the 'prelude' routine in the Italian stemmer, and
> compare with the 'prelude' routine in Romanian. In Italian, the rules
> for the direction (acute or grave) of accents over vowels were quite
> relaxed, and varied from one publishing house to another. Perhaps the
> rules are tighter now, but there was no harm in mapping the acutes to
> graves, which is done in the first repeat loop. We put the same idea
> in the Romanian stemmer.
>
> The question is, do we map cedilla to comma-under, or comma-under to
> cedilla? Since comma-under is becoming standard, it would seem sensibe
> to map cedilla to comma-under, but even so, I would suggest doing it
> the other way round, so that users who have had to rely on the cedilla
> representation will not notice any change. The stemmed forms will have
> the cedilla form of s&t, but these should be hidden from view anyway.
>
> Olly Betts manages github Snowball, and we must see what he thinks.
>
> I could write the necessary patch, but it would take me some time to
> test it out.
>
> Martin
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.tartarus.org/pipermail/snowball-discuss/attachments/20230309/715e06c3/attachment.htm>


More information about the Snowball-discuss mailing list