[Snowball-discuss] Porter stemmer algorithm (Java implementation)

Martin Porter martin.porter at grapeshot.co.uk
Tue May 6 09:27:14 BST 2008



Dane,

(I've seen your two emails -- that's okay.)

The stemmings you report are clearly not to be desired, but of course
the problem is that the algorithm does not know the language, just
various rules. gases->gase is analogous to vases->vase, and gasses->gass
is analogous to mosses->moss. gas->ga is improved on in the English
(Porter2) stemmer in at snowball.tartarus.org, and you get gas->gas in
that case.

The Porter2 stemmer has exception sets. 'gas' might be placed among
them. What I've tended to do is to include exceptions when people report
errors and confusions when doing searching for real, rather than
oddities discovered while browsing. 

Martin

(I've copied this to snowball-discuss)

On Mon, 2008-05-05 at 15:19 -0500, Dane Wyrick wrote:
> Hello,
> 
> I've been using the Java implementation of your Porter stemmer
> algorithm.  I was playing around with some chemical names to test the
> robustness of this algorithm in conjunction with the "Double
> Metaphone" algorithm to ignore spelling mistakes from user input.  I
> noticed that there is some interesting output from using the simple
> word "gas".
> 
> word : stem
> 
> gas : ga
> gases : gase
> gasses : gass
> 
> Is the behavior from the implementation actually desired?  I would
> think that all three should return "gas" as the stem.
> 
. . . . 
> 
> -- 
> Dane Wyrick





More information about the Snowball-discuss mailing list