<div dir="ltr">Thanks for the reply, Martin.<div><br></div><div>Since this issue started downstream of Snowball, there have been other discussions not on this list. Those discussions led to Robert Muir submitting pull requests to snowball (#177) and snowball-data (#23) on GitHub. He's opted to map to the comma forms, which I agree with; obviously it could be wired the other way around if necessary.</div><div><br></div><div>Thanks for the pointer to Italian! That was the model that probably would have helped me write my own patch, but Robert already had a handle on it so I'm happy to let him take care of it.</div><div><br></div><div>—Trey</div><div><div><div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div><div style="font-size:small"><font color="#999999"><span style="font-size:12.8px">Trey Jones</span><br></font></div><span style="font-size:12.8px"><font color="#999999">Staff Computational Linguist, Search Platform<br>Wikimedia Foundation</font></span><br></div><div><span style="font-size:12.8px"><font color="#999999">UTC–5 / EST</font></span></div></div></div></div></div></div></div></div></div></div></div></div></div></div></div></div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Mar 8, 2023 at 7:14 PM Martin Porter <<a href="mailto:martin.f.porter@gmail.com">martin.f.porter@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Trey,<br>
<br>
Thank you for pointing this out. (Sorry for the late reply.) When I<br>
wrote the Romanian stemmer in 2007, the data I used certainly had s&t<br>
with cedilla, and I was not aware of the s&t with comma form.<br>
<br>
Some normalisation of the text strings given to the stemmers is<br>
expected, and most especially the mapping of upper case letters to<br>
lower case. But sometimes a bit of normalisation is done in the<br>
stemmer. I think that is what should happen here.<br>
<br>
Have a look at the 'prelude' routine in the Italian stemmer, and<br>
compare with the 'prelude' routine in Romanian. In Italian, the rules<br>
for the direction (acute or grave) of accents over vowels were quite<br>
relaxed, and varied from one publishing house to another. Perhaps the<br>
rules are tighter now, but there was no harm in mapping the acutes to<br>
graves, which is done in the first repeat loop. We put the same idea<br>
in the Romanian stemmer.<br>
<br>
The question is, do we map cedilla to comma-under, or comma-under to<br>
cedilla? Since comma-under is becoming standard, it would seem sensibe<br>
to map cedilla to comma-under, but even so, I would suggest doing it<br>
the other way round, so that users who have had to rely on the cedilla<br>
representation will not notice any change. The stemmed forms will have<br>
the cedilla form of s&t, but these should be hidden from view anyway.<br>
<br>
Olly Betts manages github Snowball, and we must see what he thinks.<br>
<br>
I could write the necessary patch, but it would take me some time to<br>
test it out.<br>
<br>
Martin<br>
</blockquote></div>