[Snowball-discuss] Small changes to English stemmer

Martin Porter martin.porter at grapeshot.co.uk
Tue Jan 10 15:02:54 GMT 2006


There have been two small changes to the English (Porter2) stemming algorithm.
The first is that the Rule

    ied ies
        replace by ie if preceded by just one letter, otherwise by i

has been changed to

    ied ies
        replace by i if preceded by more than one letter, otherwise by ie

There is a corresponding change in the Snowball script:

            'ied' 'ies'
                   ((next atlimit <-'ie') or <-'i')


            'ied' 'ies'
                   ((hop 2 <-'i') or <-'ie')

This ONLY affects the two 'words' ied and ies. Formerly they stemmed to i, now
they stem to ie.

The second is that the line,

    do ( ['y'] v <-'Y' set Y_found)

which did not match the Rule

Set initial y ... to Y,

has been changed to

    do ( ['y'] <-'Y' set Y_found)

which does.

(The problem was whether to make the rule match the coding or the coding match
the rule. The point is that in English initial y, when followed by consonant,
is a vowel, but that only archaic words have this shape:- yclept and so on. I
have decided to keep things simple and treat initial y as a consonant in all
cases.)

Both these changes are trivial.

There is a rule to remove initial apostrophe in the stemmer, which I have
come to
think is a bit feeble, but it can be left in for now.

Martin





More information about the Snowball-discuss mailing list