[Snowball-discuss] Dutch stemmer: undouble "nn", "mm", "ff"?
Edwin de Jonge
Mon Jan 5 08:55:02 2004
I have given undouble improvement some thought and I think I have
generalized it and have some further improvements.
I haven't had the time to test it on the sample vocabulary, but I
probably will do this coming week.
Ok, what is the generalization: the undoubling can be done for all
consonants (except for "zz", "cc", "vv", "jj" because they are
non-existing, as far as I know). Explanation: The undouble routine is
invoked after "en", "e" or "ene" is stripped from the word. This is done
(I guess) so the stem is the same as the stem of a singular noun or
verb. The doubles "tt" and "dd" seem special because they are used in
paste tense, but they can be treated the same as the other double
Words in Dutch never (I'm not aware of any example) end in a double
consonant (unlike German), so stripping double consonants seems like a
good idea (at least to me).
If Dutch words don't end in double consonants, then what is their
function inside words?
Well, they mark a short vowel. The following vowels have a short and
long version (i.e. all vowels but "i" and "y"):
Long vowels (V) are normally written twice (VV), except when they are
followed by a consonant (c) that is followed by a
vowel (v) and they are the first character or preceded by a
consonant(are you still with me?). In that case they are written as a
single character. (=3DcVcv).=20
To mark a difference between a short and long vowel in plurals (that may
end with "en") consonants can be doubled.
Singular -> plural
"maan" -> "manen" (moon, moons), long vowel.
"man" -> "mannen" (man, men), short vowel
"boom" -> "bomen" (tree -> trees), long vowel
"bom" -> "bommen" (bomb -> bombs), short vowel.
The Dutch snowball stemming algorithm currently stems long and short
vowel versions the same (which isn't correct). This can be improved by
doubling a vowel (=3Dinjecting a vowel), if it is a long vowel: this is
only needed for the last long vowel in a word (because the rest will not
effect the stemming)
"manen", is now stemmed "man", but then would be stemmed as "maan"
"bomen", is now stemmed "bom", but then would be stemmed as "boom"
I will try to put this routine into snowball syntax (which will
hopefully more clear than this explanation)
To reply to Arjen: Yes, there are a few exception to this rule: there
are cases where a short vowel turns into a long vowel in plural. But
this is a tiny minority (to name a few: "gat" (hole), "god" (as in
english), "weg" (road), "vat" (barrel))
3) Turn word endings "z" into "s" and "v" into "f" after stripping "en"
Explanation: Dutch words never end with a "z" or "v".
"baas" -> "bazen" (boss, bosses)
"leef" -> "leven" (to live)
"neef" -> "neven" (cousin/nephew)
"huis" -> "huizen" (house/houses)
Enough about my changes, I'm interested in your idea of the apostrophes.
> -----Oorspronkelijk bericht-----
> Van: Martin Porter [mailto:email@example.com]=20
> Verzonden: donderdag 1 januari 2004 19:56
> Aan: Edwin de Jonge; firstname.lastname@example.org
> Onderwerp: Re: [Snowball-discuss] Dutch stemmer: undouble=20
> "nn", "mm", "ff"?
> Thanks for that idea, which I'll try out. There are a number=20
> of outstanding suggestions to work through, and I must set=20
> some time aside to look at them early this year.
> A new idea of mine: I think apostrophe ought to form part of=20
> the alphabet of Dutch, and indeed of English. I haven't=20
> really had time to put that in though.