[Snowball-discuss] Minor mistakes in the english vocabulary
Robert Hafner
tedivm at tedivm.com
Tue Jan 24 22:38:08 GMT 2012
@Martin- sure, feel free to!
@Jannik- I understand, and that makes sense, but I think you'd have better luck finishing a php back end for the snowball compiler than going through and porting each algorithm. This was something I started before, and honestly would like to pick up again once I have a bit more time. The Java backend seems like the best place to start for it, since PHP's OOP syntax is heavily based off of java anyways (most of the work you'll find is cutting out typing and things like that).
Robert
On Jan 24, 2012, at 3:49 AM, Jannik Zschiesche wrote:
> Hi everyone,
>
> thanks for your answers.
> Yes, I missed the most obvious spot: the list of exceptions.
>
> I will implement them and check the vocabulary again.
> I have found some differences in the german vocabulary too, but they might be (missed) exceptions, too.
>
>
> @ Robert Hafner:
> thank you for the offer, but I am working on a more general approach. I want to provide a fundament for implementations for all (listed) languages (mainly: en, es, de), and therefore I'd like to have a single library for all of them.
>
>
> Kind Regards
> Jannik Zschiesche
>
> Am 21.01.2012 um 07:10 schrieb Olly Betts:
>
>> On Fri, Jan 20, 2012 at 06:15:58PM +0100, Jannik Zschiesche wrote:
>>> While testing the implementation against the english vocabulary I
>>> found some - what I think - mistakes. Please correct me, if I am
>>> wrong.
>>> (I used Porter2: http://snowball.tartarus.org/algorithms/english/stemmer.html)
>>>
>>> In the vocabulary, there are the following transformations (and some
>>> more, but I don't want to flood you):
>>
>> These cases are all in the exceptions list - see
>> http://snowball.tartarus.org/algorithms/english/stemmer.html and search
>> for the code which starts:
>>
>> define exception1 as (
>>
>> and:
>>
>> define exception2 as (
>>
>> The reasons for these exceptions are also covered there:
>>
>> The exception lists in the English stemmer are meant to be
>> illustrative ('this is how it is done if you want to do it'), and were
>> derived piecemeal.
>>
>> a) The new stemmer improves on the Porter stemmer in handling short
>> words ending e and y. There is however a mishandling of the four forms
>> sky, skies, ski, skis, which is easily corrected by treating three of
>> these words as special cases.
>>
>> b) Similarly there is a problem with the ing form of three letter
>> verbs ending ie. There are only three such verbs: die, lie and tie, so
>> a special case is made for dying, lying and tying.
>>
>> [...]
>>
>> e) The remaining words were included following complaints from users
>> of the Porter algorithm. news is not the plural of new (noticed when
>> IR systems were being set up for Reuters). Howe is a surname, and
>> needs to be separated from how (noticed when doing a search for 'Sir
>> Geoffrey Howe' in a demonstration at the House of Commons). succeed
>> etc are not past participles, so the ed should not be removed (pointed
>> out to me in an email from India). herring should not stem to her
>> (another email from Russia).
>>
>> Cheers,
>> Olly
>>
>
> _______________________________________________
> Snowball-discuss mailing list
> Snowball-discuss at lists.tartarus.org
> http://lists.tartarus.org/mailman/listinfo/snowball-discuss
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.tartarus.org/mailman/private/snowball-discuss/attachments/20120124/92c93ec0/attachment.htm>
More information about the Snowball-discuss
mailing list