[Snowball-discuss] Error in the vocabulary for Italian stemmer?

Peter Stahl pemistahl at googlemail.com
Sun Jun 13 15:05:41 BST 2010


Hi everyone,

my name is Peter and I'm a student of computational linguistics from Bochum, Germany. During my attempt to port your stemmers to pure Python, I always compare the stemmed results of my code to those in the testing vocabulary provided on your site. At last, I compared the results of the Italian stemmer. My implementation produces the same results as yours, but with the exception of nine words. They all end with either 'ici' or 'ice'. In my opinion, the stemmed forms of those words in the appropriate diffs.txt file are wrong according to the mentioned steps of the algorithm. It is all about the following words:

giocatrici 	giocatr 		(giocatric)
mediatrice 	mediatr 		(mediatric)
pagatrice 	pagatr		(pagatric)
portatrice 	portatr 		(portatric)
portatrici 	portatr  		(portatric)
ricreatrice 	ricreatr 		(ricreatric)
roccatrici 	roccatr 		(roccatric)
salvatrice 	salvatr 		(salvatric)
sfruttatrici 	sfruttatr 		(sfruttatric)

The first column shows the unstemmed forms, the second shows the stemmed forms of my implementation, the third shows the stemmed forms of your testing vocabulary.
Let us take the word 'giocatrici' as an example:

R1 = 'atrici'
R2 = 'rici'
RV = 'catrici'

According to step 1, the suffixes 'ici' or 'ice', respectively, are supposed to be deleted if one of these is in R2. This is definitely the case here. So the word 'giocatrici' is stemmed to 'giocatr' and not to 'giocatric'. 
Is this an error in your testing vocabulary or in the description of the Italian stemmer? Or did I miss anything?


Thanks and best regards,
Peter





More information about the Snowball-discuss mailing list