[Snowball-discuss] Possible bug in Porter Stemmer

Olly Betts olly at survex.com
Sun Nov 9 19:38:11 GMT 2014


On Tue, Oct 21, 2014 at 12:59:16PM +0000, Marcel Daneck wrote:
> I might have found a bug in the porter stemmer for english. (http://snowball.tartarus.org/algorithms/porter/stemmer.html)
> In the example list of words (http://snowball.tartarus.org/algorithms/porter/diffs.txt) the word "agreement" stays "agreement" after stemming.
> 
> But step 4 says that if R2 ends with "ent" the "ent" should be deleted
> (and the Snowball code does so).
> The region 2 for "agreement" is "ent", so it should be deleted and the
> resulting stem should be "agreem".
> 
> There are rules for "ment" and "ement" which could hit too. But the
> prefix for "ment" would be "agree" which has m=1 and not m>1. (Same
> for "ement")

I think this old thread has the answer:

http://thread.gmane.org/gmane.comp.search.snowball/39

It looks like the text for the "porter2" aka "english" stemmer was
updated:

http://snowball.tartarus.org/algorithms/english/stemmer.html

But the "porter" description is still as before.

You probably want to be using porter2 rather than porter, unless you're
deliberately trying to match results obtained with the original
algorithm.

Cheers,
    Olly



More information about the Snowball-discuss mailing list