[Snowball-discuss] Dutch Stemmers

Sun Oct 6 22:52:44 BST 2019

On Sat, Oct 05, 2019 at 02:57:55PM +0100, Martin Porter wrote:
> The 1994 paper gives useful notes on the KP stemmer, but there is no
> "algorithmic description" of the whole process. I had to rely on the
> source code, written in C. Unfortunately I did not keep the source
> code, and the link
> 
> "The UPLIFT project page, from where the original stemmer can be downloaded"
> 
> from my old site and from https://snowballstem.org/ is now broken, and
> I cannot trace the source code on the internet.

The Internet Archive has the UPLIFT project page itself:

https://web.archive.org/web/20040508063432/http://let.uu.nl/~uplift/

Unfortunately they didn't archive the linked sourcecode download (which
was at http://let.uu.nl/~uplift/dstem.tar.gz), and trying to locate a
copy by the filename "dstem.tar.gz" only turns up a few mentions in
papers.

> K and P modestly described their stemmer as the "Dutch Porter"
> stemmer, so any search with Dutch, K and P just brings you back to the
> snowball work. But for completeness we should try to trace their
> original coding.

It would be good to have the original.  Perhaps we should try contacting
the authors - have you communicated with them before?

> At one point at least, their coding was a bit of a tangle, and it was
> not clear to me how it would be described algorithmically. For this
> reason alone I did not want to adopt the KP stemmer as the Dutch
> stemmer by default.

My suspicion is that the algorithm you downloaded evolved significantly
from that described by the papers.

The papers seem to say that the stemmer is described by 98 suffix
replacement rules, formed into six clusters, though unhelpfully only 2
example rules from each cluster are shown so as you say it doesn't
describe the whole process.  (The total number being 98 comes from
"Viewing Stemming as Recall Enhancement"; the 12 example rules are in
"Porter’s stemming algorithm for Dutch" and repeated in "Evaluation of a
Dutch stemming algorithm".)

But just comparing those 12 examples to the snowball implementation I
see many differences.

Step_1 has an 'en' suffix rule, but the condition is much more complex
than the "ends with consonant" that the paper has.  It doesn't have
the 'e' suffix with preceding consonant rule that the paper has, but
does have 'nde' (<-'nd') which might be an evolution of that.

Then Step_2 has neither of the 'etj' not 'tj' rules the paper shows,
but it does have a 'je' rule with various checks including a 't' before
it, so that looks like an evolution connected to the 'e' suffix no
longer being removed in Step_1.

The 'baar' rule for Step_4 is there, but the 'ig' suffix rule seems to
have become 'actig' and 'erig' rules.

Their fifth cluster examples are 'ge' prefix and infix removal, but
the conditions seem to have become more complex.

The Step_3 and Step_6 example rules shown do actually correspond
exactly.

So I think they probably started from an algorithm described by these
rules, implemented in C starting from Frakes' implementation of your
stemmer, and then a significant amount of tweaking went on to give the C
version you started from.

> I also had doubts about some of their decisions: for example is it
> okay to remove prefix ge- and infix -ge- as a past participle element
> when there are so many words containing ge-, -ge- which are not past
> participles?

I'll see if I can get feedback on that from those who commented on the
issue.

> But since the KP stemmer is reported to do better than mine, more
> credit to K and P, and on this evidence is the one to be preferred.

OK.

> I think it would be a mistake to put "kp" in place of "dutch" as a
> name. The Porter stemmer as distributed has always differed from the
> original algorithm in 3 tiny respects, but when I coded it up in
> snowball I foolishly followed the 1980 paper exactly, so the version
> in snowball differs just slightly from the official one at
> tartarus.org/martin/PorterStemmer. This has caused quite a bit of
> confusion! I would change both names to dutch_simple and dutch_kp (or
> something), in anticipation of a future release.

Yes, I wouldn't literally "mv kraaij_pohlmann.sbl dutch.sbl" as that's
going to confuse things further.

Cheers,
    Olly