[Snowball-discuss] Minor mistakes in the english vocabulary

Olly Betts olly at survex.com
Sat Jan 21 06:10:10 GMT 2012


On Fri, Jan 20, 2012 at 06:15:58PM +0100, Jannik Zschiesche wrote:
> While testing the implementation against the english vocabulary I
> found some - what I think - mistakes. Please correct me, if I am
> wrong.
> (I used Porter2: http://snowball.tartarus.org/algorithms/english/stemmer.html)
> 
> In the vocabulary, there are the following transformations (and some
> more, but I don't want to flood you):

These cases are all in the exceptions list - see
http://snowball.tartarus.org/algorithms/english/stemmer.html and search
for the code which starts:

define exception1 as (

and:

define exception2 as (

The reasons for these exceptions are also covered there:

  The exception lists in the English stemmer are meant to be
  illustrative ('this is how it is done if you want to do it'), and were
  derived piecemeal. 

  a) The new stemmer improves on the Porter stemmer in handling short
  words ending e and y. There is however a mishandling of the four forms
  sky, skies, ski, skis, which is easily corrected by treating three of
  these words as special cases. 

  b) Similarly there is a problem with the ing form of three letter
  verbs ending ie. There are only three such verbs: die, lie and tie, so
  a special case is made for dying, lying and tying.

[...]

  e) The remaining words were included following complaints from users
  of the Porter algorithm. news is not the plural of new (noticed when
  IR systems were being set up for Reuters). Howe is a surname, and
  needs to be separated from how (noticed when doing a search for 'Sir
  Geoffrey Howe' in a demonstration at the House of Commons). succeed
  etc are not past participles, so the ed should not be removed (pointed
  out to me in an email from India). herring should not stem to her
  (another email from Russia).

Cheers,
    Olly



More information about the Snowball-discuss mailing list