[Snowball-discuss] Stemming Romanian with s&t with comma (vs cedilla)
Martin Porter
martin.f.porter at gmail.com
Wed Sep 20 09:49:51 BST 2023
I've just looked into the the background of this a bit more in the
gigantic 2004 book The Unicode Standard 4.0. Comma-below and
cedilla-below occur in Turkish and in Romanian. They are different
glyphs for the same thing, and which is used in printing is often just
a characteristic of the chosen font, but as the authors explain,
"cedilla is preferred for Turkish and comma for Romanian". (They do
not tell us exactly by whom they are preferred!).
But the legacy codings only contained one form:
ISO/IEC 8859-2 has the cedilla form
ISO/IEC 8859-16 has the comma form
So if you map these to Unicode in a standard way, Romanian goes to
cedilla-form or comma-form depending on which 8-bit code was used to
represent it. As the authors say "Migrating Romanian 8-bit data to
Unicode should be done with care", (page 196). The original snowball
stemmer must have been developed with Romanian text that started life
in ISO/IEC 8859-2.
So the cedilla form in Romanian is not incorrect, exactly, it is just
not the preferred form.
More information about the Snowball-discuss
mailing list