[Snowball-discuss] Finnish stemmer: some suggestions and some doubts
Vili Lehdonvirta
vili.lehdonvirta@hut.fi
Sat Nov 29 12:11:01 2003
Hi all,
First of all let me say that the Finnish stemmer is an impressive work
from someone who presumably does not speak the language. However, a quick
glance at the sample vocabulary immediately reveals instances of what to
me seems like understemming. I spent some time looking at this and here's
what I found (if you're not interested in reading about minor
improvements to the algorithm, please skip to my questions, which are of =
a
more general nature).
In one class of instances I think the understemming is due to a shortcut
taken by the algorithm. In Finnish, some possessive suffixes (nsa ns=E4 m=
me
nne) may absorb the genitive case suffix n. For example:
edelt=E4j=E4 predecessor
edelt=E4j=E4n predecessor's
edelt=E4j=E4ns=E4 his predecessor, his predecessor's (polyseme)
The algorithm stems these by first removing the possessive suffix (step
1), if any, and then the genitive case suffix (step 3), if any. Finally,
the trailing =E4 is removed. For all of the above words the resulting ste=
m
is edelt=E4j, which seems fine.
However, if the genitive suffix is added to a plural, the plural is
manifested in various different ways before the suffix. For example:
edelt=E4j=E4t predecessors
edelt=E4jien predecessors'
The algorithm accounts for some plurals in step 6 (b-d), and for the
particular type in the example above in the last rule of step 3. Thus,
both words are stemmed to edelt=E4j.
So, finally, here comes the problem:
edelt=E4jiens=E4 his predecessors'
For the above word, step 1 correctly recognizes the possessive suffix and
proceeds to delete it. However, the remaining word edelt=E4jie does not
trigger the genitive suffix rule in step 3. The suffix n has been removed=
,
but the plural identifier ie remains. Thus, the word is stemmed to
edelt=E4jie, not edelt=E4j.
I think this could be fixed by modifying step 1 so that (nsa ns=E4 mme nn=
e)
would not be deleted, but changed into n. n is then later removed in step
3, along with ie, if present. I can't think of any side effects, though I
have not run any tests with the vocabulary.
Another type of understemming of which there seems to be a lot of in the
sample vocabulary is due to the possessive suffixes iaan, i=E4=E4n not be=
ing
recognized by the algorithm at all. However, this is not so
straightforward, as those endings may also indicate something else,
particularly for imported words like akatemia, Austraalia. This would
need more looking into before any fixes can be suggested.
Now to the doubts part. I was looking for something meaningful to do for
the Nutch project, which led me to Lucene, which led me to wonder if ther=
e
are good algorithms for normalizing Finnish, which led me here. Is this
algorithm being used in any applications? Would it be worth it spending
some time on it? Is the algorithmic stemming approach for normalizing
Finnish the best choice for projects like Nutch, in your opinion? How
about "morphological analysis" [1]?
Finally, I'm beginning to wonder whether I should just leave stemming
and normalization for IR experts and linguists. I've done basic universit=
y
cs studies and have a few years of working experience, but that's as far
as it goes.
[1] http://www.linguistlist.org/issues/4/4-862.html
Cheers,
--=20
Vili Lehdonvirta
vili.lehdonvirta@hut.fi
+65 94367590