[Snowball-discuss] Stemming with german2 on hyphenated compound words

Tue Sep 24 23:55:19 BST 2024

On Tue, Sep 24, 2024 at 10:23:23AM +0200, Simon wrote:
> As an example I have 2 words: "Export-Schnittstelle" and "Schnittstelle",
> for these words the stemmer creates "Export-Schnittstell" or
> "Schnittstell" respectively, which is great because with the right
> tokenization I can now search for "Schnitstelle" (which the stemmer within
> my search analyzer will transform to "Schnittstell") and it will match the
> second part from the word "Export-Schnittstelle" aka
> "Export-Schnittstell". 
>
> Now I would expect that this is how it works for all hyphenated compound
> words. But unfortunately that's not the case. So I now have 2 other words
> "PA-Schiene" and "Schiene". Here the stemmer creates two completely
> different words: "PA-Schi" and "Schien".
>
> Can someone explain to my why this is and if there is a way to fix this?

It's essentially because the algorithm won't remove an ending if the
stem that would leave is too short.  In this case removing `-ene` would
leave `schi` (too short) vs `pa-schi` (OK).  It removes `-e` instead
for the former case since `schien` is long enough.

I think you probably want to split words as hyphens when tokenising,
which would avoid such problems.

Cheers,
    Olly