[Snowball-discuss] Dutch Stemmers
Olly Betts
olly at survex.com
Tue Oct 8 23:19:39 BST 2019
On Mon, Oct 07, 2019 at 10:40:28AM +0100, Martin Porter wrote:
> I've sent all.txt which is easy to unpick.
The list held that message due to its size. Rather than approve it and
send everyone on the list a rather large message, I've cut it up and
created a zip archive which I've put on the website:
https://snowballstem.org/algorithms/kraaij_pohlmann/stemmer.html
> I found dstem.zip on a backup CD I pressed in 2005.
Thanks for finding that.
I had a little look at the original version.
Compiler warnings highlight one problem with it: the gt2() function is
missing `return(TRUE);` at its end, which is undefined behaviour. It
seems that at least on current x86-64 linux with current GCC that we
happen to get lucky as this doesn't affect the output - presumably
the code generated happens to always have a non-zero value in the return
register when it falls off the end of the function.
I can't exactly reproduce the list of differences in the table at the
URL above though. The C implementation looks like it expects ISO-8859-1
and if I convert voc.txt to that and the output back to UTF-8 then there
are 220 differences against output.txt compared to the 32 listed.
I wondered if you'd accidentally fed it UTF-8 input when you originally
made that table, as I noticed only one of the 32 differences contains
non-ASCII letters. However doing that gets me 36 differences - the 32
listed plus these four extras:
Source KP wrong-enc Snowball
------ ------------ --------
emiliètje emilièt emiliè
liètje lièt liè
mariètje marièt mariè
ruïneren ruïneer ruïner
The other thing I noticed is that the C implementation includes letters
with diacritics in its definition of vowels, but the Snowball version
only considers ASCII letters. I tried a quick modification to expand
the groupings in the snowball version to match those in the C version,
which reduces the differences from 220 to 153.
The webpage says "demonstration vocabulary [...] of over 45,000" and
our dutch/voc.txt has 45669 entries. I checked the history and it
doesn't look like this file has been augmented at any point (sometimes
we've added new entries to other lists to serve as a regression test for
a bug fix or change).
Perhaps you compared with an older (or newer) version of the C stemmer?
I guess it's hard to really know this much later.
Cheers,
Olly
More information about the Snowball-discuss
mailing list