[Snowball-discuss] Dutch stemmer: undouble "nn", "mm", "ff"?

Arjen van der Meijden arjen@glas.its.tudelft.nl
Mon Jan 5 15:54:01 2004


Edwin,

I have a few counter remarks, but you're mostly right, I think :)

Edwin de Jonge wrote:

> Just like you I don't think that there is/will be a perfect stemming
> algorithm for Dutch (with the exception of the absurd snowball program
> where for every word the stem is listed (the dictionary approach).

That exception won't work as well, due to the disambiguation you 
mention. Anyway, it's a thing to just accept, if you'd try to fix that, 
you'd probably end up with a massive application with an enormous amount 
of rules, which still makes mistakes ;)

> Nice that you did a search! Not trying to be a wise guy, but the words
> you have found all are of foreign origin:
> 	mazzel(en) = yiddisch/hebrew
>  	puzzle/puzzelen = english
> 	quizzen = english (plural of quiz)
> But you are right: if they are in "van Dale" then they are Dutch words
> by definition.

Not entirely, they are accepted into the Dutch language, and therefore 
are inserted into "Van Dale", not the other way around :)
But then again, once they are in Van Dale, you can be pretty sure it's 
Dutch (or accepted in the Dutch language), although a word that is not 
in Van Dale, might still be a real Dutch word.

> Our 'strong verbs' are indeed a real pain in the butt for snowball.
> Luckely in modern Dutch more 
> and more "strong" verbs are turning into "weak verbs" (but this is a
> slow process, for example before 1930
> Past tense of "wassen"(=wash) was "wies" in stead of "waste(n)").
I didn't know that :)

> True, but as said before snowball doesn't do disambiguation. 
> But it still is desirable that "manen" (in its different senses) 
> maps to "maan" (in the same different senses).

Yeah, your approach has the advantage of stemming to more stems, while 
mannen and manen can both have a few meanings, it is, indeed a win if 
maan and manen stays distinct from mannen and man.

Even if that means that both the horse's and saturn's manen get stemmed 
to maan.

>>Make sure you don't strip 'ig' if it was 'tig', like dertig, gretig, 
>>nattig, etc.
>>Actually, perhaps you shouldn't strip 'ig' at all, bazig 
>>means something 
>>different than baas. And most, if not all, -ig versions of nouns and 
>>verbs have a (sometimes slightly, derived) different meaning.
> 
> For this one I'm neutral. I think "ig" suffix is (used) the same as the
> "y" 
> suffix in english. (e.g. boss, bossy, wet, wetty). 
> You are right about the shift in meaning, but I'm not sure if it is
> enough.

I don't really know whether the shift in meaning is very bad or whether 
it'll result in weird clashes.
But the change is larger than the change you get from simply stemming a 
plural to its singular form and stemming verbs to their stem.

> Same issue here I think: disambiguation. The change proposed stems these
> words to the correct stem (only they are still ambiguous).

Yep, it really appears as if we just loved making our language as 
ambiguous as possible.

Best regards,

Arjen