[Snowball-discuss] Re: R: italian stemming

Martin Porter martin_porter@softhome.net
Thu Sep 5 09:04:01 2002


At 01:51 PM 9/3/02 +0200, enea wrote:
>I encountered another problem with your stemmer.
>israele is stemmed to israel
>israeliano is stemmed to israel (ano is verb suffix)
>israeliani, israeliane and israeliana is stemmed to israelian
>I wonder if I should remove ano from verb suffix and add ano, ani, ana
>and ane as standard suffixes since this problem is frequent (eg.
>italiano, italiani; partigiano, partigiani; gabbiano, gabbiani; indiano,
>indiani; isolano, isolani; romano, romani...). What do you think?
>Regards,
>
>Enea
>

Enea,

Yes -ano is problematical. It is interesting that the corresponding French
ending -ent (3rd person plural present indicative) is also problematical,
and is in fact not removed in the French stemmer. Of course -ent is slightly
broader than -ano, since it is the ending for all three classes of verb
conjugation.

One possibility is not to remove -ano at all. You might look at that.

Your idea of removing -ana, -ani, -ane often crops up in stemmer design, and
is quite sensible. As adjectival endings, you can think of them as

    -ano + -a
    -ano + -i
    -ano + -e

Finding -a, -i, -e here implies noun or adjective forms. Then that knowledge
is discarded, and -ano is removed as a verb ending, so that a match will
take place with -ano endings which are removed as verb endings when in fact
they are part of the stem. More generally, in an ending -A + -B, B may tell
us that A is not a true ending, but we choose to discard that information.
In fact the Porter stemmer is a bit like that - no state information is
preserved following ending removal in the different steps.

Whether -ana, -ani, -ane might be usefully added to the stemmer as endings I
cannot say and you will need to experiment (create a file like
italian/diffs.txt with the consequences of the stemmer change visible in a
third column, and inspect.) I may have tried this myself at one time, but
cannot remember. The danger is of course overstemming. -ana/-ane endings
will be removed from feminine nouns that have no corresponding -ano form.

My preferred advice however is to put the stemmer into service in its
present form and see what reactions you get.

Martin