[Snowball-discuss] Fwd: Error in the vocabulary for Italian stemmer?

Peter Stahl pemistahl at googlemail.com
Mon Jun 14 17:00:56 BST 2010



Anfang der weitergeleiteten E-Mail:

> Von: Peter Stahl <pemistahl at googlemail.com>
> Datum: 14. Juni 2010 17:56:13 MESZ
> An: Martin Porter <martin at porterloo.wanadoo.co.uk>
> Betreff: Re: [Snowball-discuss] Error in the vocabulary for Italian stemmer?
> 
> Hi Martin,
> 
> thanks for your reply. I understand what you mean, I guess. The funny thing is that I implemented my wrong interpretation of your algorithm for every language and I didn't get any errors according to your testing vocabulary, except for Italian and Portuguese. I wrote my own implementations according to the descriptions provided on your site and didn't take a look into your code. For people who want to do it the same way it would be good, if you could make it a bit clearer in the descriptions that one should not search for the longest suffix that can be deleted, as this might be a source for misunderstandings.
> 
> Best regards,
> Peter  
> 
> 
> 
> Am 14.06.2010 um 13:02 schrieb Martin Porter:
> 
>> 
>> Peter, 
>> 
>> The point is that the search is for the longest suffix, and when no action
>> is taken for that suffix, the algorithm doesn't then move to a shorter
>> suffix for which action can be taken.
>> 
>> So giocatrici (=comedienne, I assume) ends -ici as well as -atrici. The
>> longer, -atrici, is taken, and there is no action (because it is not in R2).
>> The -ici ending is not considered, even though it is in R2.
>> 
>> In other words you search for the longest suffix, and see if it removable,
>> you don't search for the longest suffix which is removable.
>> 
>> I guess the same point lies behind your second email,
>> 
>> Martin
>> 
>> 
>> 
>> 
>> 
>> At 04:05 PM 6/13/2010 +0200, Peter Stahl wrote:
>>> 
>>> Hi everyone,
>>> 
>>> my name is Peter and I'm a student of computational linguistics from
>> Bochum, Germany. During my attempt to port your stemmers to pure Python, I
>> always compare the stemmed results of my code to those in the testing
>> vocabulary provided on your site. At last, I compared the results of the
>> Italian stemmer. My implementation produces the same results as yours, but
>> with the exception of nine words. They all end with either 'ici' or 'ice'.
>> In my opinion, the stemmed forms of those words in the appropriate diffs.txt
>> file are wrong according to the mentioned steps of the algorithm. It is all
>> about the following words:
>>> 
>>> giocatrici 	giocatr 		(giocatric)
>>> mediatrice 	mediatr 		(mediatric)
>>> pagatrice 	pagatr		(pagatric)
>>> portatrice 	portatr 		(portatric)
>>> portatrici 	portatr  		(portatric)
>>> ricreatrice 	ricreatr 		(ricreatric)
>>> roccatrici 	roccatr 		(roccatric)
>>> salvatrice 	salvatr 		(salvatric)
>>> sfruttatrici 	sfruttatr 		(sfruttatric)
>>> 
>>> The first column shows the unstemmed forms, the second shows the stemmed
>> forms of my implementation, the third shows the stemmed forms of your
>> testing vocabulary.
>>> Let us take the word 'giocatrici' as an example:
>>> 
>>> R1 = 'atrici'
>>> R2 = 'rici'
>>> RV = 'catrici'
>>> 
>>> According to step 1, the suffixes 'ici' or 'ice', respectively, are
>> supposed to be deleted if one of these is in R2. This is definitely the case
>> here. So the word 'giocatrici' is stemmed to 'giocatr' and not to 'giocatric'. 
>>> Is this an error in your testing vocabulary or in the description of the
>> Italian stemmer? Or did I miss anything?
>>> 
>>> 
>>> Thanks and best regards,
>>> Peter
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> Snowball-discuss mailing list
>>> Snowball-discuss at lists.tartarus.org
>>> http://lists.tartarus.org/mailman/listinfo/snowball-discuss
>>> Content-Type: text/plain; x-avg=cert; charset=us-ascii
>> 
>>> Content-Disposition: inline
>>> Content-Description: "AVG certification"
>>> 
>>> 
>>> No virus found in this incoming message.
>>> Checked by AVG - www.avg.com 
>>> Version: 8.5.437 / Virus Database: 271.1.1/2934 - Release Date: 06/13/10
>> 06:35:00
>>> 
>> 
>> 
>> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.tartarus.org/mailman/private/snowball-discuss/attachments/20100614/a94ea18e/attachment.htm>


More information about the Snowball-discuss mailing list