[Snowball-discuss] Snowball French stemming

Fred Fung fred.fung@versaterm.com
Thu Dec 11 15:53:02 2003


This is a multi-part message in MIME format.

------=_NextPart_000_004C_01C3BFD4.E60A6A00
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

Good Day,

I am using OpenFTS 0.35 with the Snowball French stemming algorithm and =
the Snowball wrapper downloaded from the snowball.tartarus.org site. =
With this algorithm, I came across the following inconsistency :

The word "fran=E7aise" stemmed to become "fran=E7ais". This is fine. =
However, when I stemmed the word "fran=E7ais", it became "franc".

I looked at the example list of French vocabulary and its stemmed =
equivalent posted on the Snowball site under the French link, and =
"fran=E7ais" is indeed stemmed to become "franc".=20

But here is the problem : I am using this stemming algorithm in =
conjuction with the text search package OpenFTS 0.35. When I use the =
French stemming algorithm to convert a piece of text containing the word =
"fran=E7aise" into its indexing equivalent, and later, search the table =
for the word "fran=E7ais", I would expect this text to be considered as =
a match as well. But obviously, it is not the case (and I have tried it) =
since, "fran=E7ais" will be stemmed (using the same stemming algorithm) =
to "franc" before the search starts, and will never match the stem =
"fran=E7ais" stored in the tsvector field.

Is this something one has to live with using this French stemming =
algorithm ? If not, is there any way to work around the problem I =
mentioned here ?

Thanks.


Fred 
------=_NextPart_000_004C_01C3BFD4.E60A6A00
Content-Type: text/html;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META http-equiv=3DContent-Type content=3D"text/html; =
charset=3Diso-8859-1">
<META content=3D"MSHTML 6.00.2800.1106" name=3DGENERATOR>
<STYLE></STYLE>
</HEAD>
<BODY bgColor=3D#ffffff>
<DIV><FONT face=3DArial size=3D2>Good Day,</FONT></DIV>
<DIV><FONT face=3DArial size=3D2></FONT>&nbsp;</DIV>
<DIV><FONT face=3DArial size=3D2>I am using OpenFTS 0.35 with the =
Snowball French=20
stemming algorithm and the Snowball wrapper downloaded from the=20
snowball.tartarus.org site. With this algorithm, I came across the =
following=20
inconsistency :</FONT></DIV>
<DIV><FONT face=3DArial size=3D2></FONT>&nbsp;</DIV>
<DIV><FONT face=3DArial size=3D2>The word "fran=E7aise" stemmed to =
become "fran=E7ais".=20
This is fine. However, when I stemmed the word "fran=E7ais", it became=20
"franc".</FONT></DIV>
<DIV><FONT face=3DArial size=3D2></FONT>&nbsp;</DIV>
<DIV><FONT face=3DArial size=3D2>I looked at the example list of French =
vocabulary=20
and its stemmed equivalent posted on the Snowball site under the French =
link,=20
and&nbsp;"fran=E7ais" is indeed stemmed to become "franc". </FONT></DIV>
<DIV><FONT face=3DArial size=3D2></FONT>&nbsp;</DIV>
<DIV><FONT face=3DArial size=3D2>But here is the problem : I am using =
this stemming=20
algorithm in conjuction with the text search package OpenFTS 0.35. When =
I use=20
the French stemming algorithm to convert&nbsp;a piece of text containing =
the=20
word "fran=E7aise" into its indexing equivalent, and later, search the =
table for=20
the word "fran=E7ais", I would expect this text&nbsp;to be considered as =
a match=20
as well. But obviously, it is not&nbsp;the case (and I have tried=20
it)&nbsp;since, "fran=E7ais" will be stemmed (using the same stemming =
algorithm)=20
to "franc" before the search starts, and will never match the stem =
"fran=E7ais"=20
stored in the tsvector field.</FONT></DIV>
<DIV><FONT face=3DArial size=3D2></FONT>&nbsp;</DIV>
<DIV><FONT face=3DArial size=3D2>Is this something one has to live =
with&nbsp;using=20
this French stemming algorithm ? If not, is there any way to work=20
around&nbsp;the&nbsp;problem I mentioned here ?</FONT></DIV>
<DIV>&nbsp;</DIV>
<DIV><FONT face=3DArial size=3D2>Thanks.</FONT></DIV>
<DIV>&nbsp;</DIV>
<DIV>&nbsp;</DIV>
<DIV><FONT face=3DArial size=3D2>Fred</FONT>&nbsp;</DIV></BODY></HTML>

------=_NextPart_000_004C_01C3BFD4.E60A6A00--