[Snowball-discuss] Dutch Stemmers (was: Re: Snowball 2.0.0 released)

Olly Betts olly at survex.com
Sat Oct 5 07:20:50 BST 2019


On Fri, Oct 04, 2019 at 04:04:25PM +0100, Martin Porter wrote:
> Thank you so much for the work you have done on this. Retirement has
> kept me busy, but I will try and go through the changes list and send
> you any comments I might have,

Good to hear from you Martin.

There's an open issue about the Dutch stemmer which I didn't try to
address in this release, but would be good to address and which it would
be useful to have your insights on, since you came up with the "dutch"
algorithm (unless my memory is faulty).

This comment on the issue is probably the most useful one to read:

https://github.com/snowballstem/snowball/issues/1#issuecomment-69638501

(The initial report seems to be expecting the stems to all be valid
Dutch words, which indeed they aren't always but that doesn't matter for
the intended domain of use).

It's suggested there that we should just replace the "dutch" stemmer
with the "kraaij_pohlmann" one, which we could do but would mean a flag
day for users.  We have made changes to some of the stemmers over the
years, but they've always been small adjustments, not a wholesale change
to a different algorithm.

The website has a page on the Dutch algorithm, but that doesn't say
anything about the design decisions made or the reasons behind them.
I'm also not sure if your algorithm predates Kraaij-Pohlmann's, or
if you were aware of there's but were trying to come up with a better
approach.

All I've found is a couple of paragraphs on the "Germanic" page:

https://snowballstem.org/algorithms/germanic.html

| By contrast, Dutch is inflexionally simple, but even so, this does not
| make for any great difference between the stemmers. A feature of Dutch
| that makes it markedly different from German is that the grammar of the
| written language has changed, and continues to change, relatively
| rapidly, and that it has assimilated a large and mixed foreign
| vocabulary with some of the accompanying foreign suffixes. Foreign
| words may, or may not, be transliterated into a Dutch style. Naturally
| these create problems in stemming. The stemmer here is intended for
| native words of contemporary Dutch.
|
| In a Dutch noun, a vowel may double in the singular form (manen =
| moons, maan = moon). We attempt to solve this by undoubling the double
| vowel (Kraaij Pohlman by contrast attempt to double the single vowel).
| The endings je, tje, pje etc., although extremely common, are not
| stemmed. They are diminutives and can significantly alter word meaning.

Cheers,
    Olly



More information about the Snowball-discuss mailing list