[Snowball-discuss] Fwd: Error in the vocabulary for Italian stemmer?
Peter Stahl
pemistahl at googlemail.com
Mon Jun 14 17:00:56 BST 2010
Anfang der weitergeleiteten E-Mail:
> Von: Peter Stahl <pemistahl at googlemail.com>
> Datum: 14. Juni 2010 17:56:13 MESZ
> An: Martin Porter <martin at porterloo.wanadoo.co.uk>
> Betreff: Re: [Snowball-discuss] Error in the vocabulary for Italian stemmer?
>
> Hi Martin,
>
> thanks for your reply. I understand what you mean, I guess. The funny thing is that I implemented my wrong interpretation of your algorithm for every language and I didn't get any errors according to your testing vocabulary, except for Italian and Portuguese. I wrote my own implementations according to the descriptions provided on your site and didn't take a look into your code. For people who want to do it the same way it would be good, if you could make it a bit clearer in the descriptions that one should not search for the longest suffix that can be deleted, as this might be a source for misunderstandings.
>
> Best regards,
> Peter
>
>
>
> Am 14.06.2010 um 13:02 schrieb Martin Porter:
>
>>
>> Peter,
>>
>> The point is that the search is for the longest suffix, and when no action
>> is taken for that suffix, the algorithm doesn't then move to a shorter
>> suffix for which action can be taken.
>>
>> So giocatrici (=comedienne, I assume) ends -ici as well as -atrici. The
>> longer, -atrici, is taken, and there is no action (because it is not in R2).
>> The -ici ending is not considered, even though it is in R2.
>>
>> In other words you search for the longest suffix, and see if it removable,
>> you don't search for the longest suffix which is removable.
>>
>> I guess the same point lies behind your second email,
>>
>> Martin
>>
>>
>>
>>
>>
>> At 04:05 PM 6/13/2010 +0200, Peter Stahl wrote:
>>>
>>> Hi everyone,
>>>
>>> my name is Peter and I'm a student of computational linguistics from
>> Bochum, Germany. During my attempt to port your stemmers to pure Python, I
>> always compare the stemmed results of my code to those in the testing
>> vocabulary provided on your site. At last, I compared the results of the
>> Italian stemmer. My implementation produces the same results as yours, but
>> with the exception of nine words. They all end with either 'ici' or 'ice'.
>> In my opinion, the stemmed forms of those words in the appropriate diffs.txt
>> file are wrong according to the mentioned steps of the algorithm. It is all
>> about the following words:
>>>
>>> giocatrici giocatr (giocatric)
>>> mediatrice mediatr (mediatric)
>>> pagatrice pagatr (pagatric)
>>> portatrice portatr (portatric)
>>> portatrici portatr (portatric)
>>> ricreatrice ricreatr (ricreatric)
>>> roccatrici roccatr (roccatric)
>>> salvatrice salvatr (salvatric)
>>> sfruttatrici sfruttatr (sfruttatric)
>>>
>>> The first column shows the unstemmed forms, the second shows the stemmed
>> forms of my implementation, the third shows the stemmed forms of your
>> testing vocabulary.
>>> Let us take the word 'giocatrici' as an example:
>>>
>>> R1 = 'atrici'
>>> R2 = 'rici'
>>> RV = 'catrici'
>>>
>>> According to step 1, the suffixes 'ici' or 'ice', respectively, are
>> supposed to be deleted if one of these is in R2. This is definitely the case
>> here. So the word 'giocatrici' is stemmed to 'giocatr' and not to 'giocatric'.
>>> Is this an error in your testing vocabulary or in the description of the
>> Italian stemmer? Or did I miss anything?
>>>
>>>
>>> Thanks and best regards,
>>> Peter
>>>
>>>
>>>
>>> _______________________________________________
>>> Snowball-discuss mailing list
>>> Snowball-discuss at lists.tartarus.org
>>> http://lists.tartarus.org/mailman/listinfo/snowball-discuss
>>> Content-Type: text/plain; x-avg=cert; charset=us-ascii
>>
>>> Content-Disposition: inline
>>> Content-Description: "AVG certification"
>>>
>>>
>>> No virus found in this incoming message.
>>> Checked by AVG - www.avg.com
>>> Version: 8.5.437 / Virus Database: 271.1.1/2934 - Release Date: 06/13/10
>> 06:35:00
>>>
>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.tartarus.org/mailman/private/snowball-discuss/attachments/20100614/a94ea18e/attachment.htm>
More information about the Snowball-discuss
mailing list