[Snowball-discuss] Modified Stemming To Generate Valid Words

Martin Porter martin.f.porter at gmail.com
Mon Jun 12 11:19:46 BST 2017


On 6/12/17, Yash Mittal <yashmittal2009 at gmail.com> wrote:
> Over the past year or so, I have been working on
> modifying Dr. Porter's algorithm to stem words in a way that valid English
> words are generated.
Yash,

I am sorry it has taken me so long to get back to you: I am more or
less retired from computing work now (age 72) and have "frozen" the
snowball site at www.tartarus.org. I have still not fully explored
your work.

The github version of snowball is still alive, but I am not involved
in its maintenance.

It might be useful to share my own experience which has been as follows:

You often want to give the stemmed form back to a user, and it is
embarrassing if the best you can do is to give a non-dictionary text
string: "memori" instead of "memory" etc.What you can do is keep the
unstemmed form with the stemmed form, in a dictionary which you build
up as you encounter each word. Of course, a stemmed form s may derive
from several words w1, w2, w3 ... and the question then is, which of
w1, w2, w3 ... do you show the user to stand for s? The obvious guess
is the show the shortest, but that frequently gives a poor result.

Suppose you have a database of domestic objects, pepper pots, tables,
string, scissors etc. Suppose, as is plausible, "scissors" occurs 1000
times in the database, "scissor" 3 times. Then the best representative
of "scissors" is "scissors", not "scissor".

In an IR context, a retrieved set of documents may share a common
stemmed term s, the best w to show the user is the commonest among the
w1, w2, w3 ... as they appear in that retrieved set. In other words,
the choice of the best w dynamically depends on context of use.

More generally, I've found that unless this trick is used, exposing an
IR user to a stem leads to confusion, even if the stem is a valid
dictionary word. They won't benefit by seeing "organ" if its context
of use is "organic farming" etc.

Martin



More information about the Snowball-discuss mailing list