[Snowball-discuss] Minor mistakes in the english vocabulary
Olly Betts
olly at survex.com
Sat Jan 21 06:10:10 GMT 2012
On Fri, Jan 20, 2012 at 06:15:58PM +0100, Jannik Zschiesche wrote:
> While testing the implementation against the english vocabulary I
> found some - what I think - mistakes. Please correct me, if I am
> wrong.
> (I used Porter2: http://snowball.tartarus.org/algorithms/english/stemmer.html)
>
> In the vocabulary, there are the following transformations (and some
> more, but I don't want to flood you):
These cases are all in the exceptions list - see
http://snowball.tartarus.org/algorithms/english/stemmer.html and search
for the code which starts:
define exception1 as (
and:
define exception2 as (
The reasons for these exceptions are also covered there:
The exception lists in the English stemmer are meant to be
illustrative ('this is how it is done if you want to do it'), and were
derived piecemeal.
a) The new stemmer improves on the Porter stemmer in handling short
words ending e and y. There is however a mishandling of the four forms
sky, skies, ski, skis, which is easily corrected by treating three of
these words as special cases.
b) Similarly there is a problem with the ing form of three letter
verbs ending ie. There are only three such verbs: die, lie and tie, so
a special case is made for dying, lying and tying.
[...]
e) The remaining words were included following complaints from users
of the Porter algorithm. news is not the plural of new (noticed when
IR systems were being set up for Reuters). Howe is a surname, and
needs to be separated from how (noticed when doing a search for 'Sir
Geoffrey Howe' in a demonstration at the House of Commons). succeed
etc are not past participles, so the ed should not be removed (pointed
out to me in an email from India). herring should not stem to her
(another email from Russia).
Cheers,
Olly
More information about the Snowball-discuss
mailing list