[Snowball-discuss] Undoubling in Dutch stemmer

Edwin de Jonge ejne at rnd.vb.cbs.nl
Thu Dec 9 14:08:43 GMT 2004


Hi Martin/Blake,

I raised the same point Blake made, quite some time ago. I didn't have
the time to make a proposal for improvement of the snowball script for
Dutch.
I'll try to do this next week. Being a native speaker of Dutch I find
the "kk", "dd", "tt" undoubling rather arbitrary.
Double consonants in Dutch are mainly used to make the previous vowel
short (with the exception that at the end of a word no double consonants
are used).

The idea (not in snowball syntax)
One of the problems with current Dutch Stemmer is that short and long
vowel words are stemmed to the same stem.
E.g. : "makken" becomes "mak", "maken" becomes also "mak" ("maak" would
be more natural).
If the undoubling of consonant is generalized to all consonants, then
the stemmer should adjust for this effect by doubling vowels.

So the following should be done:
1) modify the undouble procedure to do the following:
 If ending in a double consonant
	remove one of the consonants.         //generalisation of
undouble rule
 else
	if ending CVC (consonant among('a''e''o''u') consonant)
		double vowel			  //make vowel long by
doubling it.
		if ending among ('v', 'z')	  //transform 'v' and
'z'  into 'f' and 's' ('huizen' -> 'huis', 'leven'->'leef'
			'v' <- 'f'
			'z' <- 's'
2) remove the vowel undoubling (step 4)
I think this change would be an improvement of your Dutch Stemmer.

It should not be very difficult to translate this into snowball, but I
don't speak snowball fluently (yet). I'll give it a try next week.

Regards,

Edwin

> -----Original Message-----
> From: Martin Porter [mailto:martin.porter at grapeshot.co.uk]
> Sent: donderdag 9 december 2004 9:32
> To: Blake Madden; snowball-discuss at lists.tartarus.org
> Subject: Re: [Snowball-discuss] Undoubling in Dutch stemmer
>
>
> I don't recall the details now, but I think I went through
> consonant by consonant trying the effect of undoubling. That
> was my approach in developing the stemmers generally. So if a
> plausible rule is not included it is because, on balance, it
> did not seem to lead to an improvement. Of course any rule
> might be reassessed, and will store this one for the next
> time I look at the Dutch stemmer.
>
> You may have noticed that there is not much work being done
> on the stemmers at the moment ...
>
> Martin
>
>
>
>
> _______________________________________________
> Snowball-discuss mailing list
> Snowball-discuss at lists.tartarus.org
> http://lists.tartarus.org/mailman/listinfo/snowball-discuss
>
###########################################

This message has been scanned by F-Secure Anti-Virus for Microsoft
Exchange.
For more information, connect to http://www.F-Secure.com/




More information about the Snowball-discuss mailing list