[Snowball-discuss] Dutch stemmer: undouble "nn", "mm", "ff"?
Arjen van der Meijden
Mon Jan 5 15:54:01 2004
I have a few counter remarks, but you're mostly right, I think :)
Edwin de Jonge wrote:
> Just like you I don't think that there is/will be a perfect stemming
> algorithm for Dutch (with the exception of the absurd snowball program
> where for every word the stem is listed (the dictionary approach).
That exception won't work as well, due to the disambiguation you
mention. Anyway, it's a thing to just accept, if you'd try to fix that,
you'd probably end up with a massive application with an enormous amount
of rules, which still makes mistakes ;)
> Nice that you did a search! Not trying to be a wise guy, but the words
> you have found all are of foreign origin:
> mazzel(en) = yiddisch/hebrew
> puzzle/puzzelen = english
> quizzen = english (plural of quiz)
> But you are right: if they are in "van Dale" then they are Dutch words
> by definition.
Not entirely, they are accepted into the Dutch language, and therefore
are inserted into "Van Dale", not the other way around :)
But then again, once they are in Van Dale, you can be pretty sure it's
Dutch (or accepted in the Dutch language), although a word that is not
in Van Dale, might still be a real Dutch word.
> Our 'strong verbs' are indeed a real pain in the butt for snowball.
> Luckely in modern Dutch more
> and more "strong" verbs are turning into "weak verbs" (but this is a
> slow process, for example before 1930
> Past tense of "wassen"(=wash) was "wies" in stead of "waste(n)").
I didn't know that :)
> True, but as said before snowball doesn't do disambiguation.
> But it still is desirable that "manen" (in its different senses)
> maps to "maan" (in the same different senses).
Yeah, your approach has the advantage of stemming to more stems, while
mannen and manen can both have a few meanings, it is, indeed a win if
maan and manen stays distinct from mannen and man.
Even if that means that both the horse's and saturn's manen get stemmed
>>Make sure you don't strip 'ig' if it was 'tig', like dertig, gretig,
>>Actually, perhaps you shouldn't strip 'ig' at all, bazig
>>different than baas. And most, if not all, -ig versions of nouns and
>>verbs have a (sometimes slightly, derived) different meaning.
> For this one I'm neutral. I think "ig" suffix is (used) the same as the
> suffix in english. (e.g. boss, bossy, wet, wetty).
> You are right about the shift in meaning, but I'm not sure if it is
I don't really know whether the shift in meaning is very bad or whether
it'll result in weird clashes.
But the change is larger than the change you get from simply stemming a
plural to its singular form and stemming verbs to their stem.
> Same issue here I think: disambiguation. The change proposed stems these
> words to the correct stem (only they are still ambiguous).
Yep, it really appears as if we just loved making our language as
ambiguous as possible.