[Snowball-discuss] German Stemmer

Richard Boulton richard at tartarus.org
Tue Nov 3 16:06:58 GMT 2009


2009/11/3 Tobias N. Sasse <tobi at byte23.de>:
> I am a german computer science student and currently doing research in
> textual analytic systems. I need stemmers for all kinds of languages (a good
> start would be English, German, French, Spanish...)
>
> I had a quick look at the German version on your site and sady recognized
> that the german version produces tons of errors. For instance a
>
>  "katze" -> "katz"
>  "kätzchen" -> "katzch"
>  "kätzchens" -> "katzch"
>
> is wrong, there is no german word "katzch" it should be "katze" (the actual
> stem) and "katz" is also wrong, the trailing "e" is missing...

These are not errors.  The stemming algorithm is not meant to return
correct words - all it is intended to do is produce the same result
for words with a closely related meaning, and a different result for
words with a different meaning.  It doesn't always do this correctly,
but as I understand it, the word "katze" corresponds to the english
word "cat", and "kätzchen" and "kätzchens" correspond to the english
word "kitten".  It therefore seems correct to me that the latter two
words should return the same stem, but the former should return a
different stem.

> So my question is: do you know an improved version, or an alternate
> algorithm? What about the other languages, and how is the quality in there -
> I am not a linguist, thus can't judge their quality....

Apart from the slight variant on the german stemmer also available on
the snowball website
(http://snowball.tartarus.org/algorithms/german2/stemmer.html), I
don't know of any other german stemmers.  Others on this list may do,
though.

-- 
Richard



More information about the Snowball-discuss mailing list