[Snowball-discuss] Quasi-infinite recursion in Turkish stemmer

Mon Aug 29 05:43:23 BST 2022

On Wed, Aug 24, 2022 at 10:31:26AM -0400, Tom Lane wrote:
> Some folks doing static code analysis on Postgres [1] discovered that
> Snowball's Turkish stemmer contains unchecked recursion: the function
> r_stem_suffix_chain_before_ki() will recurse to self indefinitely.
> It's possible to exploit that to drive the Postgres server to a stack
> overflow crash:
> 
> postgres=# SELECT ts_lexize('turkish_stem', repeat('lerdeki', 1000000));
> server closed the connection unexpectedly
>         This probably means the server terminated abnormally
>         before or while processing the request.
> 
> We're not too happy about that, as it's on the edge of being something
> we'd consider a security issue.  Ordinarily we'd just stick in a stack
> depth check and call it good, but since this is generated code that
> doesn't seem like a workable long-term fix.  Besides, other users of
> the stemmer might also be concerned about this.
> 
> I have zero competence in Snowball, so I have no ability to write
> an algorithmic fix; but I wonder if this recursion could be converted
> to iteration, or else bounded somehow.  Surely real-world cases
> wouldn't need to recurse more than a few levels.

Looking at the snowball code, it looks to me like
stem_suffix_chain_before_ki will always shorten the string before
calling itself, so the depth of recursion should be limited by the
length of the input string.

The input string for a snowball stemmer should be a word from a
human language, so as a simple way to avoid deep recursion here I'd
suggest imposing a maximum length on the input you pass to Snowball
(if you just leave anything longer alone then to the caller it's as if
the stemmer didn't change it).

For Turkish it seems you could safely skip anything longer than 117
characters (https://en.wikipedia.org/wiki/Longest_word_in_Turkish)
but even that seems an obscure and contrived example, so a lower limit
would probably be fine too.  Maybe someone reading who's familiar with
Turkish could provide some insight here?

I'm not sure what a good cross-language threshold would be, but FWIW
Xapian's indexing code ignores any "word" of more than 64 bytes of
UTF-8.  I think that was partly chosen to avoid wasting time and disk
space indexing binary data encoded as text (things like base64 and
uuencode).  Anyway nobody's ever complained about that being too low.

As to how we could address this in Snowball, I think the easy option
would be to bail out early based on input length in turkish.sbl.
Refactoring stem_suffix_chain_before_ki to iterate instead of recurse
looks complicated.  Modifying the Snowball compiler to turn
self-recursion into iteration also looks complicated.

I'll respond to Martin's wider points about this stemmer separately.

Cheers,
    Olly