[Snowball-discuss] Dutch stemmer: undouble "nn", "mm", "ff"?

Edwin de Jonge ejne@rnd.vb.cbs.nl
Mon Jan 5 14:30:02 2004


Hi list,

Thanks Arjen, for your reaction. I'v put my remarks on your comments
below in the text.
Just like you I don't think that there is/will be a perfect stemming
algorithm for Dutch (with the exception of the absurd snowball program
where for every word the stem is listed (the dictionary approach). But I
do think that a stemming algorithm can be quite good even for Dutch.=20
Furthermore you are right that snowball doesn't take context (or any
other disambiguation) into account to deal with the problem of homonyms
or homomorphic words: it just stems a word (and not a sentence or a
text).

Arjen van der Meijden wrote:
> Edwin de Jonge wrote:

> >(except for "zz", "cc", "vv", "jj" because they are=20
> > non-existing, as far as I know).
> I've done a few searches in the "Van Dale Groot woordenboek hedendaags

> Nederlands" and there aren't many occurences indeed.=20
> mazzel(en), puzzle/puzzelen, quizzen (plural of quiz).
Nice that you did a search! Not trying to be a wise guy, but the words
you have found all are of foreign origin:
	mazzel(en) =3D yiddisch/hebrew
 	puzzle/puzzelen =3D english
	quizzen =3D english (plural of quiz)
But you are right: if they are in "van Dale" then they are Dutch words
by definition.

> >The doubles "tt" and "dd" in past tense...
> Not only in past tense:
> betten, bet, bette, gebet -> bet (to dab) (but beter=20
> shouldn't be stemt to bet)
You are right. But my guess was that "tt" and "dd" are in dutch stemming
algorithm
with the intention to strip past tense. By the way beter would be
stemmed as "beeter" if=20
doubling long vowels is implemented in the dutch stemmer.

> Now I come to think of it, those context-aware-meanings might be even=20
> worse for a good stemmer, than the exceptions to the language=20
> rules :( And of course our 'strong verbs', like bijten, beet,=20
> gebeten (to bite)=20
> in contrast to beter (better) and betten.
Our 'strong verbs' are indeed a real pain in the butt for snowball.
Luckely in modern Dutch more=20
and more "strong" verbs are turning into "weak verbs" (but this is a
slow process, for example before 1930
Past tense of "wassen"(=3Dwash) was "wies" in stead of "waste(n)").

> > Singular -> plural
> > "maan" -> "manen" (moon, moons), long vowel.
> > "man" -> "mannen" (man, men), short vowel
> But of course manen also has different meanings, like 'to=20
> urge' and the=20
> long hair of a horse or lion (mane?). And I think mannen is sometimes=20
> used in replacement for bemannen (to man), although that=20
> meaning doesn't=20
> show up in Van Dale.
True, but as said before snowball doesn't do disambiguation.=20
But it still is desirable that "manen" (in its different senses)=20
maps to "maan" (in the same different senses).

> > "boom" -> "bomen" (tree -> trees), long vowel
> > "bom" -> "bommen" (bomb -> bombs), short vowel.
> And both bomen and bommen are also verbs...
Same remark as above.

> > this is a tiny minority (to name a few: "gat" (hole), "god" (as in=20
> > english), "weg" (road), "vat" (barrel))
> I'm not sure how tiny that tiny is, I found an exception to all your=20
> examples above :/
I don't agree on this one: the exceptions you have found all have to do
with disambiguation, not with the rule that a short vowel (for the same
stem)
turns into a long vowel for the same stem.
I will try to quantify the tiny fraction by using the Dutch sample
dictionary of snowball.

> I do agree though, that your stemming routine seems to=20
> probably work out=20
> better than the current one, since that also mistakes all those=20
> exceptions, including yours.
Thanks :-)

> > 3) Turn word endings "z" into "s" and "v" into "f" after stripping=20
> > "en" or "ig"
> Make sure you don't strip 'ig' if it was 'tig', like dertig, gretig,=20
> nattig, etc.
> Actually, perhaps you shouldn't strip 'ig' at all, bazig=20
> means something=20
> different than baas. And most, if not all, -ig versions of nouns and=20
> verbs have a (sometimes slightly, derived) different meaning.
For this one I'm neutral. I think "ig" suffix is (used) the same as the
"y"=20
suffix in english. (e.g. boss, bossy, wet, wetty).=20
You are right about the shift in meaning, but I'm not sure if it is
enough.

> > Example:
> > "baas" -> "bazen" (boss, bosses)
> > "bazig" (bossy)
> > "leef" -> "leven" (to live)
> > "neef" -> "neven" (cousin/nephew)
> > "huis" -> "huizen" (house/houses)
> But of course, 'bazen' is also a verb (the stem 'baas', isn't=20
> to wrong=20
> though), 'huizen' is a verb (but again, the stem 'huis' isn't very=20
> wrong) and 'neven' means 'next to' in Flemish (doesn't it?)=20
Same issue here I think: disambiguation. The change proposed stems these
words to the correct stem (only they are still ambiguous).
The 'neven' (Flemmish) (and maybe 'leven') though is different: it
should be disambigued before stemming...

> But I do hope the exceptions I think of are appreciated, I=20
> only try to=20
> help reach a better solution (and to achieve that, we need=20
> examples that=20
> stem correctly and not correct or even both at the same time) :)
I totally agree, thanks for your feedback.

Regards,

Edwin de Jonge