[Snowball-discuss] Error in the vocabulary for Italian stemmer?

Martin Porter martin at porterloo.wanadoo.co.uk
Mon Jun 14 12:02:53 BST 2010


Peter, 

The point is that the search is for the longest suffix, and when no action
is taken for that suffix, the algorithm doesn't then move to a shorter
suffix for which action can be taken.

So giocatrici (=comedienne, I assume) ends -ici as well as -atrici. The
longer, -atrici, is taken, and there is no action (because it is not in R2).
The -ici ending is not considered, even though it is in R2.

In other words you search for the longest suffix, and see if it removable,
you don't search for the longest suffix which is removable.

I guess the same point lies behind your second email,

Martin





At 04:05 PM 6/13/2010 +0200, Peter Stahl wrote:
>
>Hi everyone,
>
>my name is Peter and I'm a student of computational linguistics from
Bochum, Germany. During my attempt to port your stemmers to pure Python, I
always compare the stemmed results of my code to those in the testing
vocabulary provided on your site. At last, I compared the results of the
Italian stemmer. My implementation produces the same results as yours, but
with the exception of nine words. They all end with either 'ici' or 'ice'.
In my opinion, the stemmed forms of those words in the appropriate diffs.txt
file are wrong according to the mentioned steps of the algorithm. It is all
about the following words:
>
>giocatrici 	giocatr 		(giocatric)
>mediatrice 	mediatr 		(mediatric)
>pagatrice 	pagatr		(pagatric)
>portatrice 	portatr 		(portatric)
>portatrici 	portatr  		(portatric)
>ricreatrice 	ricreatr 		(ricreatric)
>roccatrici 	roccatr 		(roccatric)
>salvatrice 	salvatr 		(salvatric)
>sfruttatrici 	sfruttatr 		(sfruttatric)
>
>The first column shows the unstemmed forms, the second shows the stemmed
forms of my implementation, the third shows the stemmed forms of your
testing vocabulary.
>Let us take the word 'giocatrici' as an example:
>
>R1 = 'atrici'
>R2 = 'rici'
>RV = 'catrici'
>
>According to step 1, the suffixes 'ici' or 'ice', respectively, are
supposed to be deleted if one of these is in R2. This is definitely the case
here. So the word 'giocatrici' is stemmed to 'giocatr' and not to 'giocatric'. 
>Is this an error in your testing vocabulary or in the description of the
Italian stemmer? Or did I miss anything?
>
>
>Thanks and best regards,
>Peter
>
>
>
>_______________________________________________
>Snowball-discuss mailing list
>Snowball-discuss at lists.tartarus.org
>http://lists.tartarus.org/mailman/listinfo/snowball-discuss
>Content-Type: text/plain; x-avg=cert; charset=us-ascii

>Content-Disposition: inline
>Content-Description: "AVG certification"
>
>
>No virus found in this incoming message.
>Checked by AVG - www.avg.com 
>Version: 8.5.437 / Virus Database: 271.1.1/2934 - Release Date: 06/13/10
06:35:00
>






More information about the Snowball-discuss mailing list