[Snowball-discuss] Error in the vocabulary for Italian stemmer?
Peter Stahl
pemistahl at googlemail.com
Sun Jun 13 15:05:41 BST 2010
Hi everyone,
my name is Peter and I'm a student of computational linguistics from Bochum, Germany. During my attempt to port your stemmers to pure Python, I always compare the stemmed results of my code to those in the testing vocabulary provided on your site. At last, I compared the results of the Italian stemmer. My implementation produces the same results as yours, but with the exception of nine words. They all end with either 'ici' or 'ice'. In my opinion, the stemmed forms of those words in the appropriate diffs.txt file are wrong according to the mentioned steps of the algorithm. It is all about the following words:
giocatrici giocatr (giocatric)
mediatrice mediatr (mediatric)
pagatrice pagatr (pagatric)
portatrice portatr (portatric)
portatrici portatr (portatric)
ricreatrice ricreatr (ricreatric)
roccatrici roccatr (roccatric)
salvatrice salvatr (salvatric)
sfruttatrici sfruttatr (sfruttatric)
The first column shows the unstemmed forms, the second shows the stemmed forms of my implementation, the third shows the stemmed forms of your testing vocabulary.
Let us take the word 'giocatrici' as an example:
R1 = 'atrici'
R2 = 'rici'
RV = 'catrici'
According to step 1, the suffixes 'ici' or 'ice', respectively, are supposed to be deleted if one of these is in R2. This is definitely the case here. So the word 'giocatrici' is stemmed to 'giocatr' and not to 'giocatric'.
Is this an error in your testing vocabulary or in the description of the Italian stemmer? Or did I miss anything?
Thanks and best regards,
Peter
More information about the Snowball-discuss
mailing list