[Snowball-discuss] German Stemmer

Richard Boulton richard at tartarus.org
Tue Nov 3 17:24:04 GMT 2009


2009/11/3 Tobias N. Sasse <tobi at byte23.de>:
> thanks for your input. I am not very familiar to these algorithms - from my
> understanding, please correct me if I am wrong, a stemming algorithm reduces
> words to a common stem, which is not necessarily a correct word in the
> language itself. Which is ok for my use-case, as long as not too many words
> with different meanings refer to the same stem.

That's the idea.

> That sounds fine for me, I am curious - how many nouns are reduced?

The stemmer is rule based, so as many nouns as you give it are
reduced.  (I may be misunderstanding the question, here.)

> I don't
> want "carport" to be reduced to "car" as this could be a problem in my
> scenario. I know this is a difficult task, as it requires a lot of knowledge
> on the particular language and grammar...

That particular example is fine with the english stemmer (I don't know
the german equivalents, but I imagine it would be fine there, too).
The stemming is fairly conservative, usually.

> Further I'd like to know if there is data I can exploit for research: I am
> looking for stopword lists, synonym tables etc, I have been looking around
> for a while now but never found something useful... Most stopword lists only
>  contain some dozent words :-/

There are some suggested stopword lists for each language available
from the snowball website: eg, see
http://snowball.tartarus.org/algorithms/german/stemmer.html

Generally, I don't use stopwords for the work I do with search
engines, and precalculated lists of stopwords are often of little use:
you tend to need custom ones to match your dataset.  However, the
snowball ones may be of some help to you, anyway.

Snowball doesn't have synonym tables: I'd suggest looking up wordnet for them.

There are some example vocabularies for each language (used for
testing of snowball).  These are also linked to from the pages on the
snowball website which describe each stemmer.

-- 
Richard



More information about the Snowball-discuss mailing list