[Snowball-discuss] Dutch Stemmers (was: Re: Snowball 2.0.0 released)
Martin Porter
martin.f.porter at gmail.com
Sat Oct 5 14:57:55 BST 2019
I'll abbreviate "Kraaij" and "Pohlmann" to "K" and "P".
I wrote my own version of the Dutch stemmer following the one for
German, and used the simple German one as a model. The KP stemmer was
described in K and P's paper of 1994, and I'd been aware of it before
snowball began. Coding it up in snowball was intended as a demo to
illustrate snowball's usefulness, and the Lovins, Schinke Latin
stemmer and indeed the 1980 Porter stemmer, were coded up in the same
spirit. Whether the KP work was done before or after writing my own
Dutch stemmer I do not recall (though I suppose it could be worked
out), but it doesn't really matter.
The 1994 paper gives useful notes on the KP stemmer, but there is no
"algorithmic description" of the whole process. I had to rely on the
source code, written in C. Unfortunately I did not keep the source
code, and the link
"The UPLIFT project page, from where the original stemmer can be downloaded"
from my old site and from https://snowballstem.org/ is now broken, and
I cannot trace the source code on the internet. K and P modestly
described their stemmer as the "Dutch Porter" stemmer, so any search
with Dutch, K and P just brings you back to the snowball work. But for
completeness we should try to trace their original coding. At one
point at least, their coding was a bit of a tangle, and it was not
clear to me how it would be described algorithmically. For this reason
alone I did not want to adopt the KP stemmer as the Dutch stemmer by
default. I also had doubts about some of their decisions: for example
is it okay to remove prefix ge- and infix -ge- as a past participle
element when there are so many words containing ge-, -ge- which are
not past participles?
But since the KP stemmer is reported to do better than mine, more
credit to K and P, and on this evidence is the one to be preferred.
I think it would be a mistake to put "kp" in place of "dutch" as a
name. The Porter stemmer as distributed has always differed from the
original algorithm in 3 tiny respects, but when I coded it up in
snowball I foolishly followed the 1980 paper exactly, so the version
in snowball differs just slightly from the official one at
tartarus.org/martin/PorterStemmer. This has caused quite a bit of
confusion! I would change both names to dutch_simple and dutch_kp (or
something), in anticipation of a future release.
More information about the Snowball-discuss
mailing list