[Snowball-discuss] Stemming Romanian with s&t with comma (vs cedilla)

Mon Aug 7 04:42:32 BST 2023

On Thu, Mar 09, 2023 at 12:13:57AM +0000, Martin Porter wrote:
> The question is, do we map cedilla to comma-under, or comma-under to
> cedilla? Since comma-under is becoming standard, it would seem sensibe
> to map cedilla to comma-under, but even so, I would suggest doing it
> the other way round, so that users who have had to rely on the cedilla
> representation will not notice any change. The stemmed forms will have
> the cedilla form of s&t, but these should be hidden from view anyway.
> 
> Olly Betts manages github Snowball, and we must see what he thinks.

Martin's suggestion would seem to preserve compatibility better for
existing users since the only words that would change stem would be
those containing the comma-under letters.  As a general point that
seems a laudable aim to me.

However in practice it seems any non-historic dataset of Romanian text
will contain a significant proportion (and likely a majority) of the
comma-under form, so whichever form we normalise to is going to change
the stemmed form for enough words to warrant a reindex anyway.  To give
an idea of impact, 19% of romanian/voc.txt entries and 14% of
romanian/output.txt entries contain ş or ţ.  This is not weighted
by frequency, but the vocabulary should at least be a selection of
fairly common words.

There are downsides to baking the old cedilla form into the stem as a
quirk of the algorithm forever.  Really the stem is best thought of as
an opaque token (that often just happens to look a lot like a word in
the language), but people inevitably look at stems and see them as words
and sticking with cedillas would be likely to confuse some users and
lead to them waste their time looking for where their data uses the old
way of representing these accents.

We don't really have a prior case that's analogous to this - previous
changes to the algorithms have either been minor tweaks that only
affected a sufficiently small number of cases that you could get away
without having to reindex, or have been treated as a new stemmer (e.g.
Martin's revised "english" stemmer vs the original "porter").

The main reason to keep "porter" was that it implements the algorithm
as documented by a widely cited paper (or very close to it) which is
not the case for the romanian algorithm, so I don't think it's helpful
to create "romanian2", or for that matter to preserve the existing
algorithm as-is as "romanian-old".  The only use-case for using the
existing algorithm as-is seems to be if you're using an existing system
which is stuck using ISO-8859-2 and so have to use cedillas for these
accents.  I'm dubious such systems still exist, if someone is stuck
maintaining one they can stick with an older version of Snowball.

So I'm going to merge the submitted patch which standardises on the
comma-under accent, but I'd like to invite feedback from people actually
using the Romanian stemmer as to how they'd prefer us to handle this -
we can revise this so long as we do so before we make a release (I'd
like to get one out soon, but I should work through the backlog of
issues and patches so realistically it'll probably be a few months).

Cheers,
    Olly