[Snowball-discuss] Some advice needed - to Snowball or not to Snowball

Keith Whittingham kwhittingham at gmail.com
Tue Dec 27 17:16:28 GMT 2011


I'm looking for some advice.

I'm just starting on a project to help people to learn languages. I would like the users to be able to, while looking at a body of text, be able to click on a given word and have the program give the meaning. So clicking on the word "meaning" might display a dictionary definition of the word "[to] mean" for example. 

I tested the Snowball stemmer against the plurals given in the Wikipedia page: http://en.wikipedia.org/wiki/English_plural with mixed results. Here are some of the failures  for example:

Stemmed: massages, expected: massage, got: massag
Stemmed: judges, expected: judge, got: judg
Stemmed: cherries, expected: cherry, got: cherri
Stemmed: ladies, expected: lady, got: ladi
Stemmed: germanys, expected: germany, got: germani
Stemmed: harrys, expected: harry, got: harri
Stemmed: monies, expected: money, got: moni
Stemmed: pros, expected: pro, got: pros
Stemmed: calves, expected: calf, got: calv
...

Eventually, because I will need to identify the correct 'sense' it is being used and I see no other way other than a manual process to do this correctly. In the end I don't think that this will be a problem but I need a stepping stone to start with to get the system up and usable. Later on it should also act as a fallback in case there is linked sense to a given word yet.

So the question is, what should be my approach? I see a couple of alternatives.

1/ Start with Snowball and try to improve the rules

2/ Use a hybrid system between brute force and Snowball

3/ Develop my own

What do people think?

Keith

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.tartarus.org/mailman/private/snowball-discuss/attachments/20111227/9f3aa595/attachment.htm>


More information about the Snowball-discuss mailing list