[Snowball-discuss] Error in the vocabulary for Italian stemmer?

Peter Stahl pemistahl at googlemail.com
Wed Jun 16 16:02:27 BST 2010


Hi again,

now I have a different problem with the Romanian stemmer. In particular, it is about step 4 in the description. There are a lot of short words that only step 4 can apply to. But, however, they are not stemmed at all according to your results in the testing vocabulary. I don't see why. Can you please explain this?

Let's take the word 'ambe', for example. R1 is 'be', R2 and RV are empty strings. Only step 4 can apply here. The longest suffix among those of step 4 for this word is the vowel 'e'. It is in R1, too. But your results say that the word should not be stemmed to 'amb'. Why?

Thank you very much. (I hope I'm not getting on your nerves. But I really don't know how to solve this.)


Best regards,
Peter 





Am 16.06.2010 um 11:09 schrieb Martin Porter:

> 
>> For people who want to do it the same way it would be good, if you could
> make it a bit clearer in the descriptions that one should not search for the
> longest suffix that can be deleted, as this might be a source for
> misunderstandings.
> 
> I think the descriptions, if carefully read, are clear on this point, but
> the important lesson here is that the snowball system does achieve what the
> original Porter stemmer description did not, namely, it results in exact
> definitions of the algorithms, since errors in recoding are detectable and
> correctable. Incidentally, that particular error (searching for the longest
> suffix that can be deleted rather than the longest suffix, and then seeing
> if it is deletable) was built into an early encoding of the Porter stemmer
> which was standardly used for many years, and lies behind the note in the
> description of Snowball at http://snowball.tartarus.org/texts/introduction.html,
> 
> "A good test is to type in agreement. It should stem to agreement — the same
> word. If it stems to agreem there is an error."
> 
> 
> 




More information about the Snowball-discuss mailing list