[Snowball-discuss] C# Italian Stemmer
Martin Porter
martin at porterloo.wanadoo.co.uk
Fri Aug 29 09:33:08 BST 2008
Luca,
Many thanks for the contribution, which I've put in place at
http://snowball.tartarus.org/otherlangs
(Your attached program begins with hex characters EF-BB-BF before the '/*'.
Is that okay?)
I'm trying to think of comments you would appreciate ... but with your
superior knowledge of Italian I think we are more like to benefit by hearing
yours!
I think the identification of stop words should be separate from the
stemming algorithm. What significance is attached to stopwords will depend
on the IR model you have in mind, and nowadays (when all words get indexed,
and common words get low weightings) there is a case for not having the
concept 'stopword' as part of an IR system at all.
Regarding irregular verbs: I did once develop a scheme for reducing
irregular Italian verbs to a standard form. It turned into a a very big lump
of software. As well as the very large number of basic irregular forms there
are the numerous compounds (interrompere, irrompere like rompere etc). And
finally I had serious doubts about its usefulness. Many of the irregular
forms (as you know) are a change of stem in the p.p. (past partiple), and
the use of that stem in the past definite. These irregular p.p. forms often
acquire separate meanings that makes coflation with the infinitive
undesirable -- correre, corso for example.
I think the main problem with the italian stemmer is that there is no way of
distinguishing unimportant masculine/feminine variations like bianco/bianca,
from word pairs where the masculine/feminine forms have different meanings,
like banco/banca (bench/bank).
Martin
At 19:56 21/08/2008 +0200, luca wrote:
>Hello, I'm an Italian programmer witha BA in Languages and I'm interested in
>computational linguistics, looking for stemming algorithms on the internet I
>stumbled upon your site and I found your work awesome; I tried to code a
>stemmer based upon your Italian Stemming Algorithm. I haven't published it
>on the Net yet, though I send you the source code so that you can put it on
>your site, if you like.
>Any comment would be greately appreciated, as well as some hints on how
>would you manage stop words adn irregular verbs (such as the verbs to have,
>to be and so on).
>
>
>
>Luca
More information about the Snowball-discuss
mailing list