[Snowball-discuss] Polish stemmer?
Dawid Weiss
dawid.weiss at cs.put.poznan.pl
Thu Aug 30 08:25:31 BST 2007
> Hi everyone, thank you for your replies! The way I would like to use the
> stemmer is as an additional tool along with an inflection dictionary, to get
> base forms of words unknown in the dictionary.
Note stemming isn't meant to accomplish this task in a perfect way. I like the
distinction between lemmatization and stemming as an accurate base form (lemma)
vs. a distinct token denoting a concept (not necessarily a lemma, but unique).
> enough to reduce the problem of multiple forms of unknown words in the
> collection index. I noticed the Stempelator stemmer has problems with such
> words, so I wonder whether a simpler suffix stripper wouldn't suffice.
That's basically what Stempelator (and in fact Stempel) does -- it is a trained
aka-decision tree for suffix stripping. You may want to read Andrzej Bialecki's
description of Stempel and Leo Galambos' PhD thesis where the algorithm is
described in detail. I agree it doesn't work very efficiently, which only makes
the problem more interesting.
Dawid
More information about the Snowball-discuss
mailing list