[Snowball-discuss] Minor mistakes in the english vocabulary

Jannik Zschiesche hello at apfelbox.net
Tue Jan 24 11:49:52 GMT 2012


Hi everyone,

thanks for your answers.
Yes, I missed the most obvious spot: the list of exceptions.

I will implement them and check the vocabulary again.
I have found some differences in the german vocabulary too, but they might be (missed) exceptions, too.


@ Robert Hafner:
thank you for the offer, but I am working on a more general approach. I want to provide a fundament for implementations for all (listed) languages (mainly: en, es, de), and therefore I'd like to have a single library for all of them.


Kind Regards
Jannik Zschiesche

Am 21.01.2012 um 07:10 schrieb Olly Betts:

> On Fri, Jan 20, 2012 at 06:15:58PM +0100, Jannik Zschiesche wrote:
>> While testing the implementation against the english vocabulary I
>> found some - what I think - mistakes. Please correct me, if I am
>> wrong.
>> (I used Porter2: http://snowball.tartarus.org/algorithms/english/stemmer.html)
>> 
>> In the vocabulary, there are the following transformations (and some
>> more, but I don't want to flood you):
> 
> These cases are all in the exceptions list - see
> http://snowball.tartarus.org/algorithms/english/stemmer.html and search
> for the code which starts:
> 
> define exception1 as (
> 
> and:
> 
> define exception2 as (
> 
> The reasons for these exceptions are also covered there:
> 
>  The exception lists in the English stemmer are meant to be
>  illustrative ('this is how it is done if you want to do it'), and were
>  derived piecemeal. 
> 
>  a) The new stemmer improves on the Porter stemmer in handling short
>  words ending e and y. There is however a mishandling of the four forms
>  sky, skies, ski, skis, which is easily corrected by treating three of
>  these words as special cases. 
> 
>  b) Similarly there is a problem with the ing form of three letter
>  verbs ending ie. There are only three such verbs: die, lie and tie, so
>  a special case is made for dying, lying and tying.
> 
> [...]
> 
>  e) The remaining words were included following complaints from users
>  of the Porter algorithm. news is not the plural of new (noticed when
>  IR systems were being set up for Reuters). Howe is a surname, and
>  needs to be separated from how (noticed when doing a search for 'Sir
>  Geoffrey Howe' in a demonstration at the House of Commons). succeed
>  etc are not past participles, so the ed should not be removed (pointed
>  out to me in an email from India). herring should not stem to her
>  (another email from Russia).
> 
> Cheers,
>    Olly
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.tartarus.org/mailman/private/snowball-discuss/attachments/20120124/808c0c70/attachment.htm>


More information about the Snowball-discuss mailing list