[Snowball-discuss] Dutch stemmer: undouble "nn", "mm", "ff"?
Arjen van der Meijden
Mon Jan 5 13:02:01 2004
Edwin de Jonge wrote:
> Ok, what is the generalization: the undoubling can be done for all
> consonants (except for "zz", "cc", "vv", "jj" because they are
> non-existing, as far as I know).
I've done a few searches in the "Van Dale Groot woordenboek hedendaags
Nederlands" and there aren't many occurences indeed.
mazzel(en), puzzle/puzzelen, quizzen (plural of quiz). The rest of the
zz's are mostly taken from italian and english words (showbizz, pizza
> Explanation: The undouble routine is
> invoked after "en", "e" or "ene" is stripped from the word. This is done
> (I guess) so the stem is the same as the stem of a singular noun or
> verb. The doubles "tt" and "dd" seem special because they are used in
> paste tense, but they can be treated the same as the other double
Not only in past tense:
betten, bet, bette, gebet -> bet (to dab) (but beter shouldn't be stemt
zitten, zat, gezeten -> zit (to sit) (zat is also 'drunk')
vatten, vatte, heeft gevat -> vat (to catch/grasp) (but, vat -> vaten =
spitten -> spit, spitte, heeft gespit (to dig/pitch) (and of course
'spit' is also 'lumbago' (backache) and the english 'spit')
bidden -> bid, bad, gebeden (to pray/bid) (and 'bad' is also 'bath')
wedden -> wed, wedde, gewed (to bet)
But indeed, they can probably be treated the same :) (hitte (heat) vs
hit (hit) might be the only clash you get?)
Now I come to think of it, those context-aware-meanings might be even
worse for a good stemmer, than the exceptions to the language rules :(
And of course our 'strong verbs', like bijten, beet, gebeten (to bite)
in contrast to beter (better) and betten.
> Words in Dutch never (I'm not aware of any example) end in a double
> consonant (unlike German), so stripping double consonants seems like a
> good idea (at least to me).
Me neither, except a few english words wich made it into our
dictionaries like jazz, boss etc.
> If Dutch words don't end in double consonants, then what is their
> function inside words?
> Well, they mark a short vowel. The following vowels have a short and
> long version (i.e. all vowels but "i" and "y"):
> "a", "aa"
> "e", "ee"
> "u", "uu"
> "o", "oo"
> Long vowels (V) are normally written twice (VV), except when they are
> followed by a consonant (c) that is followed by a
> vowel (v) and they are the first character or preceded by a
> consonant(are you still with me?). In that case they are written as a
> single character. (=cVcv).
> To mark a difference between a short and long vowel in plurals (that may
> end with "en") consonants can be doubled.
> Singular -> plural
> "maan" -> "manen" (moon, moons), long vowel.
> "man" -> "mannen" (man, men), short vowel
But of course manen also has different meanings, like 'to urge' and the
long hair of a horse or lion (mane?). And I think mannen is sometimes
used in replacement for bemannen (to man), although that meaning doesn't
show up in Van Dale.
> "boom" -> "bomen" (tree -> trees), long vowel
> "bom" -> "bommen" (bomb -> bombs), short vowel.
And both bomen and bommen are also verbs...
> The Dutch snowball stemming algorithm currently stems long and short
> vowel versions the same (which isn't correct). This can be improved by
> doubling a vowel (=injecting a vowel), if it is a long vowel: this is
> only needed for the last long vowel in a word (because the rest will not
> effect the stemming)
> "manen", is now stemmed "man", but then would be stemmed as "maan"
> "bomen", is now stemmed "bom", but then would be stemmed as "boom"
Well, not if you're talking about 'de manen van een paard' (the manes of
a horse) in contrast to 'de manen van Saturnus' (the moons of Saturn).
And 'bomen door het moeras' (pole through the swamp) vs 'bomen in het
bos' (trees in the forest) (even worse 'bomen in het moeras', does that
mean the trees or 'to pole' ?).
> I will try to put this routine into snowball syntax (which will
> hopefully more clear than this explanation)
> To reply to Arjen: Yes, there are a few exception to this rule: there
> are cases where a short vowel turns into a long vowel in plural. But
> this is a tiny minority (to name a few: "gat" (hole), "god" (as in
> english), "weg" (road), "vat" (barrel))
I'm not sure how tiny that tiny is, I found an exception to all your
examples above :/
We'll probably be able to find an exception to any rule.
I do agree though, that your stemming routine seems to probably work out
better than the current one, since that also mistakes all those
exceptions, including yours.
> 3) Turn word endings "z" into "s" and "v" into "f" after stripping "en"
> or "ig"
Make sure you don't strip 'ig' if it was 'tig', like dertig, gretig,
Actually, perhaps you shouldn't strip 'ig' at all, bazig means something
different than baas. And most, if not all, -ig versions of nouns and
verbs have a (sometimes slightly, derived) different meaning.
Although I'm not to sure if there are any other clashes apart from the
slightly different noun/verb-versions.
And isn't 'ing' similar to 'ig' in this contrast? I just looked it up,
and that is something you can always strip off (this time I couldn't
even find a really bad exception ;) ), it is (only?) used to create the
noun that expresses the 'performance'/'execution' of a verb. But by
doing so, the meaning does change a bit of course.
> Explanation: Dutch words never end with a "z" or "v".
Indeed, only a very small amount of imported words (quiz and jazz will
be the most used words). But if you stem quiz to quis and jazz to jazs,
there are no clashes, it just looks wrong ;)
> "baas" -> "bazen" (boss, bosses)
> "bazig" (bossy)
> "leef" -> "leven" (to live)
> "neef" -> "neven" (cousin/nephew)
> "huis" -> "huizen" (house/houses)
But of course, 'bazen' is also a verb (the stem 'baas', isn't to wrong
though), 'huizen' is a verb (but again, the stem 'huis' isn't very
wrong) and 'neven' means 'next to' in Flemish (doesn't it?) and 'leven'
is a noun (his live).
> Enough about my changes, I'm interested in your idea of the apostrophes.
And enough of my nasty exceptions for now, any stemming routine will
have flaws in Dutch, unless it manages to grasp the meaning of entire
articles and stems that context-based... And even that will probably
make mistakes, since the Dutch themselves can not always manage to
understand each and every, grammatically correct, Dutch text perfectly ;)
But I do hope the exceptions I think of are appreciated, I only try to
help reach a better solution (and to achieve that, we need examples that
stem correctly and not correct or even both at the same time) :)
Arjen van der Meijden