[Snowball-discuss] Dutch stemmer: undouble "nn", "mm", "ff"?
Edwin de Jonge
ejne@rnd.vb.cbs.nl
Thu Jan 1 13:35:02 2004
This is a multi-part message in MIME format.
------_=_NextPart_001_01C3CEE7.324E0CAE
Content-Type: text/plain;
charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
Hi,
=20
First I want to thank Martin Porter (and everyone else working on
snowball) for his work on snowball.
=20
As search engine for a research project we are using a .NET port of
Lucene (lucene.NET, not to be confused with nlucene).
Because this port doesn't have a dutch stemmer, I've implemented the
dutch snowball stemming algorithm in C#.
(my implementation will be available in a next version of Lucene.NET).
It stems the dutch snowball vocabulary exactly as snowball does.
=20
I think I have found a small improvement in the dutch stemming algorithm
(beware, I'm not a linguist).
The routine
=20
define undouble as (
test among('kk' 'dd' 'tt') [next] delete
)
=20
will be improved if the "nn", "mm" and "ff" endings are also removed.
=20
define undouble as (
test among('kk' 'dd' 'tt' 'nn' 'mm' 'ff') [next] delete
)
=20
After this algorithm change, the stemmed dutch snowball vocabulary has
494 differences with the old stemmed vocabulary. (That is in my
implementation)
(Almost) All of these differences are improvements:
plural are correctly stemmed the same as singulars:=20
"mannen" -> "man" (=3Dmen, man)
"stoffen" -> "stof" (=3Dsubstance)
"vlammen" -> "vlam" (=3Dflame)
infinitives are correctly stemmed to verb stem
"kennen" -> "ken" (=3Dknow)
"treffen" -> "tref" (=3Dhit)
"zwemmen" -> "zwem" (=3Dswim)
=20
The only strange difference (of the 494) I've found is "binnen"
(=3Dinside) was stemmed to "binnen" and is now stemmed to "bin". This =
is
not a problem since this new stem is not taken by another word.
=20
Regards,
=20
Edwin de Jonge
------_=_NextPart_001_01C3CEE7.324E0CAE
Content-Type: text/html;
charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD><TITLE>Bericht</TITLE>
<META http-equiv=3DContent-Type content=3D"text/html; =
charset=3Dus-ascii">
<META content=3D"MSHTML 6.00.2800.1276" name=3DGENERATOR></HEAD>
<BODY>
<DIV><SPAN class=3D238004514-30122003><FONT face=3DArial=20
size=3D2>Hi,</FONT></SPAN></DIV>
<DIV><SPAN class=3D238004514-30122003><FONT face=3DArial=20
size=3D2></FONT></SPAN> </DIV>
<DIV><SPAN class=3D238004514-30122003><FONT face=3DArial size=3D2>First =
I want to=20
thank Martin Porter (and everyone else working on snowball) for his work =
on=20
snowball.</FONT></SPAN></DIV>
<DIV><SPAN class=3D238004514-30122003><FONT face=3DArial=20
size=3D2></FONT></SPAN> </DIV>
<DIV><SPAN class=3D238004514-30122003><FONT face=3DArial size=3D2>As =
search engine for=20
a research project we are using a .NET port of Lucene (lucene.NET, not =
to be=20
confused with nlucene).</FONT></SPAN></DIV>
<DIV><SPAN class=3D238004514-30122003><FONT face=3DArial =
size=3D2>Because this port=20
doesn't have a dutch stemmer, I've implemented the dutch snowball =
stemming=20
algorithm in C#.</FONT></SPAN></DIV>
<DIV><SPAN class=3D238004514-30122003><FONT face=3DArial size=3D2>(my=20
implementation will be available in a next version of Lucene.NET). =
It stems=20
the dutch snowball vocabulary exactly as snowball =
does.</FONT></SPAN></DIV>
<DIV><SPAN class=3D238004514-30122003><FONT face=3DArial=20
size=3D2></FONT></SPAN> </DIV>
<DIV><SPAN class=3D238004514-30122003><FONT face=3DArial size=3D2>I =
think I have found=20
a small improvement in the dutch stemming algorithm (beware, I'm not a=20
linguist).</FONT></SPAN></DIV>
<DIV><SPAN class=3D238004514-30122003><FONT face=3DArial size=3D2>The=20
routine</FONT></SPAN></DIV>
<DIV><SPAN class=3D238004514-30122003><FONT face=3DArial size=3D2>
<DIV><SPAN class=3D238004514-30122003><FONT face=3DArial=20
size=3D2></FONT></SPAN> </DIV>
<DIV><SPAN class=3D238004514-30122003><FONT face=3DArial =
size=3D2> =20
define undouble as (<BR> test=20
among('kk' 'dd' 'tt') [next] delete<BR> =
)</FONT></SPAN></DIV>
<DIV><SPAN class=3D238004514-30122003></SPAN> </DIV>
<DIV><SPAN class=3D238004514-30122003>will be improved if the =
"nn", "mm" and=20
"ff" endings are also removed.</SPAN></DIV>
<DIV><SPAN =
class=3D238004514-30122003></SPAN> </DIV></FONT></SPAN></DIV>
<DIV><SPAN class=3D238004514-30122003><FONT face=3DArial =
size=3D2> =20
define undouble as (<BR> test=20
among('kk' 'dd' 'tt' 'nn' 'mm' 'ff') [next] delete<BR> =
)</FONT></SPAN></DIV>
<DIV><SPAN class=3D238004514-30122003><FONT face=3DArial=20
size=3D2></FONT></SPAN> </DIV>
<DIV><SPAN class=3D238004514-30122003><FONT face=3DArial size=3D2>After =
this algorithm=20
change, the stemmed dutch snowball vocabulary has 494 differences =
with the=20
old stemmed vocabulary. (That is in my =
implementation)</FONT></SPAN></DIV>
<DIV><SPAN class=3D238004514-30122003><FONT face=3DArial =
size=3D2>(Almost) All of=20
these differences are improvements:</FONT></SPAN></DIV>
<DIV><SPAN class=3D238004514-30122003><FONT face=3DArial =
size=3D2> =20
plural are correctly stemmed the same as singulars: </FONT></SPAN></DIV>
<DIV><SPAN class=3D238004514-30122003><FONT face=3DArial=20
size=3D2> "mannen" -> "man" =
(=3Dmen,=20
man)</FONT></SPAN></DIV>
<DIV><SPAN class=3D238004514-30122003><FONT face=3DArial=20
size=3D2> "stoffen" -> =
"stof"=20
(=3Dsubstance)</FONT></SPAN></DIV>
<DIV><SPAN class=3D238004514-30122003><FONT face=3DArial=20
size=3D2> "vlammen" -> =
"vlam"=20
(=3Dflame)</FONT></SPAN></DIV>
<DIV><SPAN class=3D238004514-30122003><FONT face=3DArial =
size=3D2> =20
infinitives are correctly stemmed to verb stem</FONT></SPAN></DIV>
<DIV><SPAN class=3D238004514-30122003><FONT face=3DArial=20
size=3D2> "kennen" -> "ken" =
(=3Dknow)</FONT></SPAN></DIV>
<DIV><SPAN =
class=3D238004514-30122003> =20
<FONT face=3DArial size=3D2>"treffen" -> "tref" =
(=3Dhit)</FONT></SPAN></DIV>
<DIV><SPAN =
class=3D238004514-30122003> =20
<FONT face=3DArial size=3D2>"zwemmen" -> "zwem" =
(=3Dswim)</FONT></SPAN></DIV>
<DIV><SPAN class=3D238004514-30122003><FONT face=3DArial =
size=3D2></FONT></SPAN><SPAN=20
class=3D238004514-30122003><FONT face=3DArial =
size=3D2> </DIV></FONT></SPAN>
<DIV><SPAN class=3D238004514-30122003><FONT face=3DArial size=3D2>The =
only strange=20
difference (of the 494) I've found is "binnen" (=3Dinside) was =
stemmed to=20
"binnen" and is now stemmed to "bin". This is not a problem since =
this=20
new stem is not taken by another word.</FONT></SPAN></DIV>
<DIV><FONT face=3DArial><FONT size=3D2><SPAN=20
class=3D238004514-30122003></SPAN></FONT></FONT> </DIV>
<DIV><FONT face=3DArial><FONT size=3D2><SPAN=20
class=3D238004514-30122003>Regards,</SPAN></FONT></FONT></DIV>
<DIV><FONT face=3DArial><FONT size=3D2><SPAN=20
class=3D238004514-30122003></SPAN></FONT></FONT> </DIV>
<DIV><FONT face=3DArial><FONT size=3D2><SPAN =
class=3D238004514-30122003>Edwin de=20
Jonge</DIV></SPAN></FONT></FONT></BODY></HTML>
=00
------_=_NextPart_001_01C3CEE7.324E0CAE--