[Snowball-discuss] Stemming Romanian with s&t with comma (vs cedilla)
Trey Jones
tjones at wikimedia.org
Tue Feb 28 23:04:52 GMT 2023
Hi All,
I work on the Search Platform Team at the Wikimedia Foundation and I've
been looking at Romanian Wikipedia data and working with the Romanian
Elasticsearch analysis chain, which uses Lucene analyzers, which use the
Snowball stemmer.
Historically poor software, OS, and font support made s&t with cedilla a
practical choice for many years, but s&t with commas are generally well
supported now, and seem to have been since at least 2010. See more at
Wikipedia:
https://en.wikipedia.org/wiki/Romanian_alphabet#Comma-below_(%C8%99_and_%C8%9B)_versus_cedilla_(%C5%9F_and_%C5%A3)
I recently discovered that the Lucene stopword list uses the cedilla forms,
and the following discussion on GitHub revealed that the Snowball stemmer
uses the cedilla forms, too. At least on Romanian Wikipedia, the correct
comma forms are much more common, though both are present.
For example, şi ("and", with cedilla) occurs in 4,976 documents[1],
while și ("and" with comma) occurs in 288,999 documents[2]. I haven't yet
finished my analysis, but I expect conflating ș and ş will have a big
impact on stemming on Wikipedia. While I can patch our implementation, and
I think the Lucene folks will patch theirs, it'd be great if the underlying
Snowball stemmer could handle modern Romanian text correctly, too.
I'd make a pull request on GitHub, but I don't speak Snowball well enough
to be sure how to do it. But in general it looks like if variables s and t
on lines 26 and 27 here:
https://github.com/snowballstem/snowball/blob/master/algorithms/romanian.sbl#L26-L27
could be expanded to include {U+0219} (s with comma) and {U+021B} (t with
comma), respectively, that might do it.
Thanks for all the work that's gone into the Snowball stemmers over the
years!
—Trey
[1]
https://ro.wikipedia.org/w/index.php?search=%22%C5%9Fi%22&title=Special:C%C4%83utare&profile=default&fulltext=1
[2]
https://ro.wikipedia.org/w/index.php?search=%22%C8%99i%22&title=Special:C%C4%83utare&profile=default&fulltext=1
Trey Jones
Staff Computational Linguist, Search Platform
Wikimedia Foundation
UTC–5 / EST
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.tartarus.org/pipermail/snowball-discuss/attachments/20230228/0db451df/attachment.htm>
More information about the Snowball-discuss
mailing list