[Snowball-discuss] Polish stemmer?

Dawid Weiss dawid.weiss at cs.put.poznan.pl
Thu Aug 30 08:25:31 BST 2007


> Hi everyone, thank you for your replies! The way I would like to use the
> stemmer is as an additional tool along with an inflection dictionary, to get
> base forms of words unknown in the dictionary.

Note stemming isn't meant to accomplish this task in a perfect way. I like the 
distinction between lemmatization and stemming as an accurate base form (lemma) 
vs. a distinct token denoting a concept (not necessarily a lemma, but unique).

> enough to reduce the problem of multiple forms of unknown words in the
> collection index. I noticed the Stempelator stemmer has problems with such
> words, so I wonder whether a simpler suffix stripper wouldn't suffice.

That's basically what Stempelator (and in fact Stempel) does -- it is a trained 
aka-decision tree for suffix stripping. You may want to read Andrzej Bialecki's 
description of Stempel and Leo Galambos' PhD thesis where the algorithm is 
described in detail. I agree it doesn't work very efficiently, which only makes 
the problem more interesting.

Dawid



More information about the Snowball-discuss mailing list