[Snowball-discuss] Evening -> Even, Marine -> Marin

Martin Porter martin.f.porter at gmail.com
Sat Jun 24 10:02:59 BST 2017


Ann B,

It is actually quite difficult to give a useful answer here, without
knowing more about the context of the work you are doing. If the
purpose is to set up an IR system for internet access and documents
are ranked by some tf.idf scheme, you may find the conflations you
point out do not matter too much. Marin/Marine would usually be
searched in a somewhat larger query-context,

Marine Le Pen
Louis Marin
royal marines
Marin County

where the extra words would help towards a useful ranking.

Generally speaking, algorithmic stemming will lead to the problems you
note, but it is possible to build up exception lists: the snowball
source of the English stemmer shows how to do this. If the language is
English and the software environment able to accommodate programs in
C, extending routine 'exception2' in this encoding,

http://snowball.tartarus.org/otherlangs/english_c.txt

might be the easiest option. (This module will also be on the
github-snowball site).

Martin



More information about the Snowball-discuss mailing list