[Xapian-discuss] Question on how to handle some bad result from stem algorithm?

Olly Betts olly at survex.com
Wed Oct 5 15:27:31 BST 2011


It's confusing to start an unrelated discussion by replying to an
existing thread - better to send a new email.

On Fri, Sep 30, 2011 at 04:25:13PM +0800, Bruce Zhang wrote:
> When using Stem library, it works well in most case,
> 
> however we also notice some bad result caused by stem, some examples are:
> 
> Community, communication and communicator can be searched by each other,
> though we thought they are not same,

It sounds like you're using the "porter" stemmer which conflates these
three (to "commun").  Use "english" instead, which produces "communiti"
for "community", and "communic" for the other two (which seems reasonable
as they are closely related).  The "porter" stemmer is just there for
people who really want the original version of Martin Porter's
algorithm.

> Anime, animal, animated can be searched by each other

These three are still conflated by the "english" stemmer.  The first and
last doesn't seem so bad ("anime" is a particular sort of "animated"
film) but "animal" seems rather unhelpful.

We just take the algorithms from the snowball project though, so that's
the best place to report problematic cases:

http://snowball.tartarus.org/

> What's the thought? is any good way to avoid this?
> Is any other equivalent algorithm but simple?

The "english" algorithm is a better option than "porter".

I've heard just performing the first steps of the Porter algorithm is
pretty effective, but we don't have an implementation of that currently.

Cheers,
    Olly



More information about the Xapian-discuss mailing list