[Snowball-discuss] Dutch Stemmers

Tue Oct 8 23:19:39 BST 2019

On Mon, Oct 07, 2019 at 10:40:28AM +0100, Martin Porter wrote:
> I've sent all.txt which is easy to unpick.

The list held that message due to its size.  Rather than approve it and
send everyone on the list a rather large message, I've cut it up and
created a zip archive which I've put on the website:

https://snowballstem.org/algorithms/kraaij_pohlmann/stemmer.html

> I found dstem.zip on a backup CD I pressed in 2005.

Thanks for finding that.

I had a little look at the original version.

Compiler warnings highlight one problem with it: the gt2() function is
missing `return(TRUE);` at its end, which is undefined behaviour.  It
seems that at least on current x86-64 linux with current GCC that we
happen to get lucky as this doesn't affect the output - presumably
the code generated happens to always have a non-zero value in the return
register when it falls off the end of the function.

I can't exactly reproduce the list of differences in the table at the
URL above though.  The C implementation looks like it expects ISO-8859-1
and if I convert voc.txt to that and the output back to UTF-8 then there
are 220 differences against output.txt compared to the 32 listed.

I wondered if you'd accidentally fed it UTF-8 input when you originally
made that table, as I noticed only one of the 32 differences contains
non-ASCII letters.  However doing that gets me 36 differences - the 32
listed plus these four extras:

Source		KP wrong-enc	Snowball
------          ------------    --------
emiliètje	emilièt		emiliè
liètje		lièt		liè
mariètje	marièt		mariè
ruïneren	ruïneer		ruïner

The other thing I noticed is that the C implementation includes letters
with diacritics in its definition of vowels, but the Snowball version
only considers ASCII letters.  I tried a quick modification to expand
the groupings in the snowball version to match those in the C version,
which reduces the differences from 220 to 153.

The webpage says "demonstration vocabulary [...] of over 45,000" and
our dutch/voc.txt has 45669 entries.  I checked the history and it
doesn't look like this file has been augmented at any point (sometimes
we've added new entries to other lists to serve as a regression test for
a bug fix or change).

Perhaps you compared with an older (or newer) version of the C stemmer?

I guess it's hard to really know this much later.

Cheers,
    Olly