[Snowball-discuss] Error in the vocabulary for Italian stemmer?
Martin Porter
martin at porterloo.wanadoo.co.uk
Mon Jun 14 12:02:53 BST 2010
Peter,
The point is that the search is for the longest suffix, and when no action
is taken for that suffix, the algorithm doesn't then move to a shorter
suffix for which action can be taken.
So giocatrici (=comedienne, I assume) ends -ici as well as -atrici. The
longer, -atrici, is taken, and there is no action (because it is not in R2).
The -ici ending is not considered, even though it is in R2.
In other words you search for the longest suffix, and see if it removable,
you don't search for the longest suffix which is removable.
I guess the same point lies behind your second email,
Martin
At 04:05 PM 6/13/2010 +0200, Peter Stahl wrote:
>
>Hi everyone,
>
>my name is Peter and I'm a student of computational linguistics from
Bochum, Germany. During my attempt to port your stemmers to pure Python, I
always compare the stemmed results of my code to those in the testing
vocabulary provided on your site. At last, I compared the results of the
Italian stemmer. My implementation produces the same results as yours, but
with the exception of nine words. They all end with either 'ici' or 'ice'.
In my opinion, the stemmed forms of those words in the appropriate diffs.txt
file are wrong according to the mentioned steps of the algorithm. It is all
about the following words:
>
>giocatrici giocatr (giocatric)
>mediatrice mediatr (mediatric)
>pagatrice pagatr (pagatric)
>portatrice portatr (portatric)
>portatrici portatr (portatric)
>ricreatrice ricreatr (ricreatric)
>roccatrici roccatr (roccatric)
>salvatrice salvatr (salvatric)
>sfruttatrici sfruttatr (sfruttatric)
>
>The first column shows the unstemmed forms, the second shows the stemmed
forms of my implementation, the third shows the stemmed forms of your
testing vocabulary.
>Let us take the word 'giocatrici' as an example:
>
>R1 = 'atrici'
>R2 = 'rici'
>RV = 'catrici'
>
>According to step 1, the suffixes 'ici' or 'ice', respectively, are
supposed to be deleted if one of these is in R2. This is definitely the case
here. So the word 'giocatrici' is stemmed to 'giocatr' and not to 'giocatric'.
>Is this an error in your testing vocabulary or in the description of the
Italian stemmer? Or did I miss anything?
>
>
>Thanks and best regards,
>Peter
>
>
>
>_______________________________________________
>Snowball-discuss mailing list
>Snowball-discuss at lists.tartarus.org
>http://lists.tartarus.org/mailman/listinfo/snowball-discuss
>Content-Type: text/plain; x-avg=cert; charset=us-ascii
>Content-Disposition: inline
>Content-Description: "AVG certification"
>
>
>No virus found in this incoming message.
>Checked by AVG - www.avg.com
>Version: 8.5.437 / Virus Database: 271.1.1/2934 - Release Date: 06/13/10
06:35:00
>
More information about the Snowball-discuss
mailing list