[Snowball-discuss] Quasi-infinite recursion in Turkish stemmer

Tue Aug 8 04:00:01 BST 2023

[Replying to a fairly old mail, sorry for being slow]

On Thu, Aug 25, 2022 at 11:29:25AM +0100, Martin Porter wrote:
> It may be useful to look at this problem in a bit more in the
> historical context of how the various stemmers came about. In snowball
> there were the internally developed stemmers (actually developed by
> me), written and tested according to the standards suggested on the
> snowball site. In addition there were also contributed stemmers, put
> on the site out of interest to others, which we were not in a position
> to evaluate through lack of knowledge of the languages they covered.
> We expected queries about and maintenance of this secondary group to
> be directed to the originators of the work, but inevitably links to
> the originators decayed over time.

This internal/contributed distinction is perhaps less relevant now
you've retired from Snowball development.  Some of the contributed
stemmers are much less well documented though, which makes maintaining
them harder.  I'm requiring documentation for newly contributed stemmers
but some contributors don't seem keen to write it.

> Turkish was a contributed stemmer, and one about which I had
> misgivings, never resolved. If you look at the 2-column result of a
> stemmer, for example
> 
> http://snowball.tartarus.org/algorithms/french/diffs.txt
> 
> (nowadays, you may need to do firefox's "view->repair text encoding"
> or equivalent) you see the first column with the usual ragged right
> edge and the second column where the words repeat in neat blocks. This
> is what you want, but the Turkish stemmer did not seem to do it. And
> also, the stemmer seemed to be very long.

It seems Turkish has a lot of suffixes so that length may be warranted
(at least I'm not surprised that it's longer than most, but maybe it's
still longer than necessary).

> I hope we can get Olly Betts' opinion on all this.

Looking at some examples, it seems that quite often there are two stems
that forms of a particular word get mapped to (I think this is at least
part of why you don't see your "neat blocks" in the two column output).

For example `odun` (firewood) taking the declension from
https://en.wiktionary.org/wiki/odun#Turkish :

odun                          odu
oduna                         odu
odunda                        odu
odundan                       odu
odunlar                       odun
odunlara                      odun
odunlarda                     odun
odunlardan                    odun
odunları                      odun
odunların                     odu
odunu                         odu
odunun                        odu

Here the singular forms all stem to "odu", while the plural forms all
stem to "odun" except for the definite genitive plural.  The much longer
list of possessive forms listed in wiktionary also all stem to either
"odu" or "odun".

(In English "firewoods" seems unnatural; I'm not sure if that's true in
Turkish or not but that's beside the point as there are singular
possessive forms which stem to "odun").

This "two stem" feature is a form of understemming which doesn't make
the stemmer useless but does mean it's less effective than it could be.

The other issue I've noticed is that the check to prevent it stemming
very short words is implemented as a check that the input has at least
two syllables, which works much less well than the approach used by
Martin's stemmers (and some of the contributed ones) of having regions
(R1, R2 and/or RV) which the *removed suffix* has to be entirely in).
This means that the Turkish stemmer creates some very short stems, e.g.
all these words in turkish/voc.txt stem to just "a", and checking them
in a dictionary they are not all forms of the same word ("ada" means
"island"; "ata" means "ancestor"; "aya" means "palm (of the hand)" or is
the dative singular of "moon"; "aydın" means "bright"; "ayse" means
"therefore"; ...):

a                             a
ada                           a
adadır                        a
adan                          a
adaydı                        a
adayken                       a
adaymış                       a
adayım                        a
adır                          a
alara                         a
anınız                        a
ata                           a
atan                          a
atandı                        a
atayım                        a
aya                           a
aydın                         a
aydınlar                      a
ayla                          a
aysam                         a
aysan                         a
ayse                          a
aysen                         a
ayı                           a

So it's overstemming these and probably most other words with one and
two character stems.  Maybe retrofitting an R1/R2 approach would be
helpful here.

> I did once look at it with an IR specialist whose first language was
> Turkish, but the meeting was not fruitful.

That's a shame.  If there's anyone reading with knowledge of Turkish
it would be good to have some input on this.

> Anyway, the question arises, quite apart from the stack-overflow
> issue, are you finding any benefits in using the Turkish stemmer?

AIUI Tom's a maintainer of postgres so probably isn't using the Turkish
stemmer himself.  There are probably end-users of postgres using it, but
probably none are subscribed to this list (at least personally I don't
subscribe to lists for software which is only a dependency of something
I'm directly using).

We have had issues reported about the Turkish stemmer, so there's
evidence of use.  It's hard to know how much it's helping the people
using it as they typically just highlight cases where it doesn't do
a good job, but based on what I've noticed I suspect it's generally
helping.  Consider that for "odun" it conflates 58 different forms into
just two (though admittedly this may overstate the gain as this likely
includes obscure forms which rarely if ever occur in practice).

However, it seems like it could do better for words which currently
have two stemmed forms, and the overly short stems it can produce mean
it is currently making things worse for a small subset of words.

Cheers,
    Olly