[Snowball-discuss] Evening -> Even, Marine -> Marin
Martin Porter
martin.f.porter at gmail.com
Sat Jun 24 10:02:59 BST 2017
Ann B,
It is actually quite difficult to give a useful answer here, without
knowing more about the context of the work you are doing. If the
purpose is to set up an IR system for internet access and documents
are ranked by some tf.idf scheme, you may find the conflations you
point out do not matter too much. Marin/Marine would usually be
searched in a somewhat larger query-context,
Marine Le Pen
Louis Marin
royal marines
Marin County
where the extra words would help towards a useful ranking.
Generally speaking, algorithmic stemming will lead to the problems you
note, but it is possible to build up exception lists: the snowball
source of the English stemmer shows how to do this. If the language is
English and the software environment able to accommodate programs in
C, extending routine 'exception2' in this encoding,
http://snowball.tartarus.org/otherlangs/english_c.txt
might be the easiest option. (This module will also be on the
github-snowball site).
Martin
More information about the Snowball-discuss
mailing list