[Snowball-discuss] Minor mistakes in the english vocabulary
Jannik Zschiesche
hello at apfelbox.net
Tue Jan 24 11:49:52 GMT 2012
Hi everyone,
thanks for your answers.
Yes, I missed the most obvious spot: the list of exceptions.
I will implement them and check the vocabulary again.
I have found some differences in the german vocabulary too, but they might be (missed) exceptions, too.
@ Robert Hafner:
thank you for the offer, but I am working on a more general approach. I want to provide a fundament for implementations for all (listed) languages (mainly: en, es, de), and therefore I'd like to have a single library for all of them.
Kind Regards
Jannik Zschiesche
Am 21.01.2012 um 07:10 schrieb Olly Betts:
> On Fri, Jan 20, 2012 at 06:15:58PM +0100, Jannik Zschiesche wrote:
>> While testing the implementation against the english vocabulary I
>> found some - what I think - mistakes. Please correct me, if I am
>> wrong.
>> (I used Porter2: http://snowball.tartarus.org/algorithms/english/stemmer.html)
>>
>> In the vocabulary, there are the following transformations (and some
>> more, but I don't want to flood you):
>
> These cases are all in the exceptions list - see
> http://snowball.tartarus.org/algorithms/english/stemmer.html and search
> for the code which starts:
>
> define exception1 as (
>
> and:
>
> define exception2 as (
>
> The reasons for these exceptions are also covered there:
>
> The exception lists in the English stemmer are meant to be
> illustrative ('this is how it is done if you want to do it'), and were
> derived piecemeal.
>
> a) The new stemmer improves on the Porter stemmer in handling short
> words ending e and y. There is however a mishandling of the four forms
> sky, skies, ski, skis, which is easily corrected by treating three of
> these words as special cases.
>
> b) Similarly there is a problem with the ing form of three letter
> verbs ending ie. There are only three such verbs: die, lie and tie, so
> a special case is made for dying, lying and tying.
>
> [...]
>
> e) The remaining words were included following complaints from users
> of the Porter algorithm. news is not the plural of new (noticed when
> IR systems were being set up for Reuters). Howe is a surname, and
> needs to be separated from how (noticed when doing a search for 'Sir
> Geoffrey Howe' in a demonstration at the House of Commons). succeed
> etc are not past participles, so the ed should not be removed (pointed
> out to me in an email from India). herring should not stem to her
> (another email from Russia).
>
> Cheers,
> Olly
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.tartarus.org/mailman/private/snowball-discuss/attachments/20120124/808c0c70/attachment.htm>
More information about the Snowball-discuss
mailing list