[Snowball-discuss] Dutch stemmer: undouble "nn", "mm", "ff"?

Edwin de Jonge ejne@rnd.vb.cbs.nl
Thu Jan 1 13:35:02 2004


This is a multi-part message in MIME format.

------_=_NextPart_001_01C3CEE7.324E0CAE
Content-Type: text/plain;
	charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

Hi,
=20
First I want to thank Martin Porter (and everyone else working on
snowball) for his work on snowball.
=20
As search engine for a research project we are using a .NET port of
Lucene (lucene.NET, not to be confused with nlucene).
Because this port doesn't have a dutch stemmer, I've implemented the
dutch snowball stemming algorithm in C#.
(my implementation will be available in a next version of Lucene.NET).
It stems the dutch snowball vocabulary exactly as snowball does.
=20
I think I have found a small improvement in the dutch stemming algorithm
(beware, I'm not a linguist).
The routine
=20
    define undouble as (
        test among('kk' 'dd' 'tt') [next] delete
    )
=20
will be improved  if the "nn", "mm" and "ff" endings are also removed.
=20
    define undouble as (
        test among('kk' 'dd' 'tt' 'nn' 'mm' 'ff') [next] delete
    )
=20
After this algorithm change, the stemmed dutch snowball vocabulary has
494 differences with the old stemmed vocabulary. (That is in my
implementation)
(Almost) All of these differences are improvements:
    plural are correctly stemmed the same as singulars:=20
        "mannen" -> "man" (=3Dmen, man)
        "stoffen" -> "stof" (=3Dsubstance)
        "vlammen" -> "vlam" (=3Dflame)
    infinitives are correctly stemmed to verb stem
        "kennen" -> "ken" (=3Dknow)
        "treffen" -> "tref" (=3Dhit)
        "zwemmen" -> "zwem" (=3Dswim)
=20
The only strange difference (of the 494) I've found is "binnen"
(=3Dinside) was stemmed to "binnen" and  is now stemmed to "bin". This =
is
not a problem since this new stem is not taken by another word.
=20
Regards,
=20
Edwin de Jonge

------_=_NextPart_001_01C3CEE7.324E0CAE
Content-Type: text/html;
	charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD><TITLE>Bericht</TITLE>
<META http-equiv=3DContent-Type content=3D"text/html; =
charset=3Dus-ascii">
<META content=3D"MSHTML 6.00.2800.1276" name=3DGENERATOR></HEAD>
<BODY>
<DIV><SPAN class=3D238004514-30122003><FONT face=3DArial=20
size=3D2>Hi,</FONT></SPAN></DIV>
<DIV><SPAN class=3D238004514-30122003><FONT face=3DArial=20
size=3D2></FONT></SPAN>&nbsp;</DIV>
<DIV><SPAN class=3D238004514-30122003><FONT face=3DArial size=3D2>First =
I want to=20
thank Martin Porter (and everyone else working on snowball) for his work =
on=20
snowball.</FONT></SPAN></DIV>
<DIV><SPAN class=3D238004514-30122003><FONT face=3DArial=20
size=3D2></FONT></SPAN>&nbsp;</DIV>
<DIV><SPAN class=3D238004514-30122003><FONT face=3DArial size=3D2>As =
search engine for=20
a research project we are using a .NET port of Lucene (lucene.NET, not =
to be=20
confused with nlucene).</FONT></SPAN></DIV>
<DIV><SPAN class=3D238004514-30122003><FONT face=3DArial =
size=3D2>Because this port=20
doesn't have a dutch stemmer, I've implemented the dutch snowball =
stemming=20
algorithm in C#.</FONT></SPAN></DIV>
<DIV><SPAN class=3D238004514-30122003><FONT face=3DArial size=3D2>(my=20
implementation&nbsp;will be available in a next version of Lucene.NET). =
It stems=20
the dutch snowball vocabulary exactly as snowball =
does.</FONT></SPAN></DIV>
<DIV><SPAN class=3D238004514-30122003><FONT face=3DArial=20
size=3D2></FONT></SPAN>&nbsp;</DIV>
<DIV><SPAN class=3D238004514-30122003><FONT face=3DArial size=3D2>I =
think I have found=20
a small improvement in the dutch stemming algorithm (beware, I'm not a=20
linguist).</FONT></SPAN></DIV>
<DIV><SPAN class=3D238004514-30122003><FONT face=3DArial size=3D2>The=20
routine</FONT></SPAN></DIV>
<DIV><SPAN class=3D238004514-30122003><FONT face=3DArial size=3D2>
<DIV><SPAN class=3D238004514-30122003><FONT face=3DArial=20
size=3D2></FONT></SPAN>&nbsp;</DIV>
<DIV><SPAN class=3D238004514-30122003><FONT face=3DArial =
size=3D2>&nbsp;&nbsp;&nbsp;=20
define undouble as (<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; test=20
among('kk' 'dd' 'tt') [next] delete<BR>&nbsp;&nbsp;&nbsp; =
)</FONT></SPAN></DIV>
<DIV><SPAN class=3D238004514-30122003></SPAN>&nbsp;</DIV>
<DIV><SPAN class=3D238004514-30122003>will be improved&nbsp; if the =
"nn", "mm" and=20
"ff" endings are also removed.</SPAN></DIV>
<DIV><SPAN =
class=3D238004514-30122003></SPAN>&nbsp;</DIV></FONT></SPAN></DIV>
<DIV><SPAN class=3D238004514-30122003><FONT face=3DArial =
size=3D2>&nbsp;&nbsp;&nbsp;=20
define undouble as (<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; test=20
among('kk' 'dd' 'tt' 'nn' 'mm' 'ff') [next] delete<BR>&nbsp;&nbsp;&nbsp; =

)</FONT></SPAN></DIV>
<DIV><SPAN class=3D238004514-30122003><FONT face=3DArial=20
size=3D2></FONT></SPAN>&nbsp;</DIV>
<DIV><SPAN class=3D238004514-30122003><FONT face=3DArial size=3D2>After =
this algorithm=20
change, the stemmed dutch snowball vocabulary has&nbsp;494 differences =
with the=20
old stemmed vocabulary. (That is in my =
implementation)</FONT></SPAN></DIV>
<DIV><SPAN class=3D238004514-30122003><FONT face=3DArial =
size=3D2>(Almost) All of=20
these differences are improvements:</FONT></SPAN></DIV>
<DIV><SPAN class=3D238004514-30122003><FONT face=3DArial =
size=3D2>&nbsp;&nbsp;&nbsp;=20
plural are correctly stemmed the same as singulars: </FONT></SPAN></DIV>
<DIV><SPAN class=3D238004514-30122003><FONT face=3DArial=20
size=3D2>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; "mannen" -&gt; "man" =
(=3Dmen,=20
man)</FONT></SPAN></DIV>
<DIV><SPAN class=3D238004514-30122003><FONT face=3DArial=20
size=3D2>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; "stoffen" -&gt; =
"stof"=20
(=3Dsubstance)</FONT></SPAN></DIV>
<DIV><SPAN class=3D238004514-30122003><FONT face=3DArial=20
size=3D2>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; "vlammen" -&gt; =
"vlam"=20
(=3Dflame)</FONT></SPAN></DIV>
<DIV><SPAN class=3D238004514-30122003><FONT face=3DArial =
size=3D2>&nbsp;&nbsp;&nbsp;=20
infinitives are correctly stemmed to verb stem</FONT></SPAN></DIV>
<DIV><SPAN class=3D238004514-30122003><FONT face=3DArial=20
size=3D2>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; "kennen" -&gt; "ken" =

(=3Dknow)</FONT></SPAN></DIV>
<DIV><SPAN =
class=3D238004514-30122003>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;=20
<FONT face=3DArial size=3D2>"treffen" -&gt; "tref" =
(=3Dhit)</FONT></SPAN></DIV>
<DIV><SPAN =
class=3D238004514-30122003>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;=20
<FONT face=3DArial size=3D2>"zwemmen" -&gt; "zwem" =
(=3Dswim)</FONT></SPAN></DIV>
<DIV><SPAN class=3D238004514-30122003><FONT face=3DArial =
size=3D2></FONT></SPAN><SPAN=20
class=3D238004514-30122003><FONT face=3DArial =
size=3D2>&nbsp;</DIV></FONT></SPAN>
<DIV><SPAN class=3D238004514-30122003><FONT face=3DArial size=3D2>The =
only strange=20
difference (of the 494)&nbsp;I've found is "binnen" (=3Dinside) was =
stemmed to=20
"binnen" and &nbsp;is now stemmed to "bin". This is not a problem since =
this=20
new&nbsp;stem is not taken by another word.</FONT></SPAN></DIV>
<DIV><FONT face=3DArial><FONT size=3D2><SPAN=20
class=3D238004514-30122003></SPAN></FONT></FONT>&nbsp;</DIV>
<DIV><FONT face=3DArial><FONT size=3D2><SPAN=20
class=3D238004514-30122003>Regards,</SPAN></FONT></FONT></DIV>
<DIV><FONT face=3DArial><FONT size=3D2><SPAN=20
class=3D238004514-30122003></SPAN></FONT></FONT>&nbsp;</DIV>
<DIV><FONT face=3DArial><FONT size=3D2><SPAN =
class=3D238004514-30122003>Edwin de=20
Jonge</DIV></SPAN></FONT></FONT></BODY></HTML>
=00
------_=_NextPart_001_01C3CEE7.324E0CAE--