[Snowball-discuss] Re: Porter Stemmer Question

Martin Porter martin_porter@SoftHome.net
Fri Apr 11 12:03:01 2003


Jason,

Thanks for your email. I think you will need to adapt my work for your
needs, but you might find the following useful:



The Porter stemmer behaves less well the Porter2 stemmer here, but to recover
the suffix in the Porter stemmer, do the following:

If W stems to S, put a mark in W after the longest common prefix of S and W,
e.g.

    W = rat|ing
    S = rate

then -

advance the mark if it precedes 'y' (alimon|y -> alimony|)
advance the mark if it separates a double consonant (cut|ting -> cutt|ing)
other than double l before y (thoughtful|ly is left alone).

With my sample vocab, you then get these endings,

ability
able
ableness
ables
ably
al
alities
ality
alize
alized
ally
alness
als
ance
ances
ancy
ant
anted
ants
ate
ated
ately
ates
ating
ation
ational
ations
ative
atively
atives
ator
ators
d    (agree|d)
ds   (algorithm weakness)
e
eable
eably  (interchang|eably)
eal    (hymen|eal)
ealed
ed
eds
eer    (moutain|eer)
eered
eering
eers
eful
eing
ely
ement
ements
ence
ences
enci
encies
ency
eness
enesses
ent
entative
entatives
ented
enting
ently
ents
eous
eously
eousness
er
ered
ering
erings
ers
es
ess
esses
ful
fulness
fuls
ibilities
ibility
ible
ibles
ic
ical
ically
icals
icate
icated
icates
icating
ication
ications
icative
iced
icing
icities
icity
ics
ies
ilities
ility
ing
ings
ion
ional
ioned
ions
ism
isms
itied
ities
ity
ive
ively
iveness
ives
ization
ize
ized
izer
izes
izing
ly
ment
ments
ness
nessed
nesses
nessing
or
ors
ou     (algorithm weakness)
ous
ously
ousness
s

A problem is that final -e is restored on one syllable words only. This is
okay for IR work, but linguistically inconsistent. Bob Krovetz adapted the
algorithm in more or less the way you want for dictionary lookup, but
the fruits of his labours are not available on the web anywhere.

Martin


P.S. I'm sure you wont mind me posting this on
snowball-discuss@lists.tartarus.org



At 15:45 10/04/2003 +0100, Jason Dutta wrote:
>Hello there, 
>
>I am currently a student at the University of Exeter, England, and am doing
a third year project for my Computer Science degree. I'm trying to create an
intelligent spelling checker using sentence parsing, dictionaries and your
stemmer to create more valid suggestions to misspelled words. 
>
>One question about your stemmer, it works perfectly, as it should, but is
there any way to find out which suffixes had been stemmed?
>
>An example:
>
>Word entered: meetings
>Stemmed word: meet
>
>What I'm trying to get is the 'ings' part of the stem. I know it is not as
simple as it seems, since some words change a letter or two instead of just
removing them. I'm going to try to write a Java method (my project is being
made in Java) in your Stemmer class to capture this stem and then to be
called from another class. 
>
>I hope this makes sense! Any help you bring may improve my project tenfold.
Thanks in advance.
>