[Snowball-discuss] problems using the English stemmer in java

Martin Porter martin.porter at grapeshot.co.uk
Sun Aug 29 16:23:35 BST 2004


Shai,

>is there a way that u know of to get the proper english word that results
from the generated stem ?

You need to use a complete English vocabulary (assuming the language of
application is English). For each word in the vocab, find the stem,

    horses->hors

This gives a file that can be inverted,

    hors->horses

There will be >=1 stemmed forms for a given stem:

    hors->horse
    hors->horses
    hors->horsed
    hors->horsing

('horse' can be a verb: to horse around etc). Choose the shortest:

    hors->horse

This gives a mapping of stemmed form to real word, which can be used to
reconstruct a proper English word from a stemmed form.

There are several word lists of English available on the Internet. See for
example,

http://www.gtoal.com/wordgames/yawl/word.list

-- Martin





More information about the Snowball-discuss mailing list