[Snowball-discuss] 99% of English words ending -sis and -xis are not plurals

the Tolkin family tolkin@mediaone.net
Tue, 4 Dec 2001 22:53:19 -0500


This is a multi-part message in MIME format.

------=_NextPart_000_001C_01C17D16.77E05E70
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

A while ago I said I might suggest more fundamental changes to
the approach used in the Porter2 stemmer. =20
Here is another one, probably my last.
(You probably can also improve handling of  f  ->  v  e.g.life, self, etc.)

There are over 800 words that end with -sis and of these only 11,
about 1%, are plurals. =20
Almost all the rest are singular words, whose plural ends with -ses.
The words that are plurals are generally quite uncommon.  Here they are:
brindisis chaprassis dalasis kolbasis kolbassis lassis
pachisis parchesis parchisis reversis sannyasis tsotsis

So instead of the current rule, which simply removes the final -s, I propos=
e
the following rule, which changes -sis to -ses, with a few exceptions. =20
(We generally want to conflate singular and plural.  But there are too=20
many -ses words to go in the usual direction from plural to singular. =20
So this rules goes in the other direction.)
This must be run before the current rule 1, so I'll call it rule 0.5a. =20
I express this in pseudocode.

if word ends with sis {
  if word is sis then stem is sis && stop
  if word is psis then stem is psi && stop
  if word is thesis then stem is thesis && stop
  if word is theses then stem is thesis && stop
  change final sis to ses
}

I put special handling for thesis and theses because otherwise these
would become "these".  Certainly thesis is a likely search term.
(Another possible stem for thesis and theses might be "thes".)

(The rule above could be written so that -sis must occur in the R1
or R2 region.   That would remove the special cases for sis and psis,=20
but would cause the need to add several others.)

The 11 true plurals above are not longer handled correctly, but those words
are rare and many other plurals are not handled correctly today, so I do no=
t bother
to fix them  Perhaps could special case lassis -> lassi to avoid clash with=
 lass.

Another possible special case is "basis".  The rule above conflates it with=
 bases,
which is its plural, but that causes it to also conflate with base. One mig=
ht want
to add another special case: if word is basis then stem is basis && stop
This rules causes a few conflations that might not be as desirable as possi=
ble,
e.g. ellipsis and ellipses, synapsis and synapses, phasis and phases,
and whosis and whose. =20
These could also be worth adding to the list of special cases.
But I have tried to have as few as possible.

An analogous rule applies to -xis.  Again, almost all of the about 60 words=
=20
ending with -xis are not plural. =20
The rule 0.5b below captures this, and the few exceptions.

if word ends with xis {
  if word is xis then stem is xi && stop
  if word is maxis then stem is maxi && stop
  if word is taxis then stem is taxi && stop
  change final xis to xes
}

Here axis gets conflated with axes (its plural) but also with axe.  That se=
ems
acceptable.  (There is a singular word taxis, with plural taxes, but both t=
hose
strings are far more common in their usual meaning.  We do not want to=20
conflate taxis with tax.)

Misc.
I have written these as 2 separate rules but a performance tweak might test=
 if
the word ends with -is first.

On a completely separate topic, the words "lens" is another word
that should be special cased to return "lens" as its stem , so that
it conflates with lenses (and so it does not conflate with the=20
common computer science abbreviation for length.)

References:
This analysis is based on the very large list of words known as YAWL (Yet A=
nother
Word List) available from e.g. http://personal.riverusers.com/~thegrendel/s=
oftware.html
and elsewhere.

Hopefully helpfully yours,
Steve
--=20
Steven Tolkin          steve.tolkin@fmr.com      617-563-0516=20
Fidelity Investments   82 Devonshire St. V1D     Boston MA 02109
There is nothing so practical as a good theory.  Comments are by me,=20
not Fidelity Investments, its subsidiaries or affiliates.



_____________________________________________________________________
VirusChecked by the Incepta Group plc
_____________________________________________________________________
------=_NextPart_000_001C_01C17D16.77E05E70
Content-Type: text/html; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META http-equiv=3DContent-Type content=3D"text/html; charset=3Diso-8859-1"=
>
<META content=3D"MSHTML 5.50.4616.200" name=3DGENERATOR>
<STYLE></STYLE>
</HEAD>
<BODY bgColor=3D#ffffff>
<DIV><FONT face=3DArial size=3D2>A while ago I said I might suggest&nbsp;mo=
re=20
fundamental changes to</FONT></DIV>
<DIV><FONT face=3DArial size=3D2>the approach used in the Porter2 stemmer.&=
nbsp;=20
</FONT></DIV>
<DIV><FONT face=3DArial size=3D2>Here is another one, probably my last.</FO=
NT></DIV>
<DIV><FONT face=3DArial size=3D2>(You probably can also improve handling of=
&nbsp;=20
f&nbsp; -&gt;&nbsp; v&nbsp; e.g.life</FONT><FONT face=3DArial size=3D2>, se=
lf,=20
etc.)</FONT></DIV>
<DIV><FONT face=3DArial size=3D2></FONT>&nbsp;</DIV>
<DIV><FONT face=3DArial size=3D2>There are&nbsp;over 800 words that end wit=
h -sis=20
and of these only 11,</FONT></DIV>
<DIV><FONT face=3DArial size=3D2>about 1%, </FONT><FONT face=3DArial size=
=3D2>are=20
plurals.&nbsp; </FONT></DIV>
<DIV><FONT face=3DArial size=3D2>Almost all the rest are singular words, wh=
ose=20
plural ends with -ses.</FONT></DIV>
<DIV><FONT face=3DArial size=3D2>The words that are plurals are generally q=
uite=20
uncommon.&nbsp; Here they are:</FONT></DIV>
<DIV><FONT face=3DArial size=3D2>brindisis chaprassis dalasis kolbasis kolb=
assis=20
lassis<BR>pachisis parchesis parchisis reversis sannyasis=20
tsotsis<BR></FONT></DIV>
<DIV><FONT face=3DArial size=3D2>So instead of the current rule, which simp=
ly=20
removes the final -s, I propose</FONT></DIV>
<DIV><FONT face=3DArial size=3D2>the following rule, which changes -sis to =
-ses,=20
with a few </FONT><FONT face=3DArial size=3D2>exceptions.&nbsp; </FONT></DI=
V>
<DIV><FONT face=3DArial size=3D2>(We generally want to conflate singular an=
d=20
plural.&nbsp; But </FONT><FONT face=3DArial size=3D2>there are&nbsp;too=20
</FONT></DIV>
<DIV><FONT face=3DArial size=3D2>many -ses words to go in the usual directi=
on from=20
</FONT><FONT face=3DArial size=3D2>plural to singular.&nbsp; </FONT></DIV>
<DIV><FONT face=3DArial size=3D2>So this rules goes in the other=20
direction.)</FONT></DIV>
<DIV><FONT face=3DArial size=3D2>This must be run before the current rule 1=
,=20
</FONT><FONT face=3DArial size=3D2>so I'll call it rule 0.5a.&nbsp; </FONT>=
</DIV>
<DIV><FONT face=3DArial size=3D2>I express this in pseudocode.</FONT></DIV>
<DIV><FONT face=3DArial size=3D2></FONT>&nbsp;</DIV>
<DIV><FONT face=3DArial size=3D2>if word ends with sis {</FONT></DIV>
<DIV><FONT face=3DArial size=3D2>&nbsp; if word is sis then stem is sis &am=
p;&amp;=20
stop</FONT></DIV>
<DIV><FONT face=3DArial size=3D2>&nbsp; if word is psis then stem is psi &a=
mp;&amp;=20
stop</FONT></DIV>
<DIV><FONT face=3DArial size=3D2>&nbsp; if word is thesis then stem is&nbsp=
;thesis=20
&amp;&amp; stop</FONT></DIV>
<DIV><FONT face=3DArial size=3D2>&nbsp; if word is theses then stem is thes=
is=20
&amp;&amp; stop</FONT></DIV>
<DIV><FONT face=3DArial size=3D2>&nbsp; change final sis to ses</FONT></DIV=
>
<DIV><FONT face=3DArial size=3D2>}</FONT></DIV>
<DIV><FONT face=3DArial size=3D2></FONT>&nbsp;</DIV>
<DIV><FONT face=3DArial size=3D2>I put special handling for thesis and thes=
es=20
because otherwise these</FONT></DIV>
<DIV><FONT face=3DArial size=3D2>would become "these".&nbsp; Certainly thes=
is is a=20
likely search term.</FONT></DIV><FONT face=3DArial size=3D2>
<DIV><FONT face=3DArial size=3D2>(Another possible stem for thesis and thes=
es might=20
be "thes".)</FONT></DIV>
<DIV>&nbsp;</DIV>
<DIV>(The rule above could be written so that -sis must occur in the R1</DI=
V>
<DIV>or R2 region.&nbsp;&nbsp; That would remove the special cases for sis =
and=20
psis, </DIV>
<DIV>but would cause the need to add several others.)</DIV>
<DIV>&nbsp;</DIV>
<DIV>The 11 true plurals above are not longer handled correctly, but those=20
words</FONT></DIV>
<DIV><FONT face=3DArial size=3D2>are rare and many other plurals are not ha=
ndled=20
correctly today, so I do not bother</FONT></DIV>
<DIV><FONT face=3DArial size=3D2>to fix them&nbsp; Perhaps could special ca=
se lassis=20
-&gt; lassi to avoid clash with lass.</FONT></DIV>
<DIV><FONT face=3DArial size=3D2></FONT>&nbsp;</DIV>
<DIV><FONT face=3DArial size=3D2>Another possible special case is "basis".&=
nbsp; The=20
rule above conflates it with bases,</FONT></DIV>
<DIV><FONT face=3DArial size=3D2>which is its plural, but that causes it to=
 also=20
conflate with base.&nbsp;One might want</FONT></DIV>
<DIV><FONT face=3DArial size=3D2>to add another special case: if word is ba=
sis then=20
stem is basis &amp;&amp; stop</FONT></DIV>
<DIV><FONT face=3DArial size=3D2>This rules causes a few conflations that m=
ight not=20
be as desirable as possible,</FONT></DIV>
<DIV><FONT face=3DArial size=3D2>e.g. ellipsis and ellipses, synapsis and s=
ynapses,=20
phasis and phases,</FONT></DIV>
<DIV><FONT face=3DArial size=3D2>and whosis and whose.&nbsp; </FONT></DIV>
<DIV><FONT face=3DArial size=3D2>These&nbsp;could also be worth adding to t=
he list=20
of special cases.</FONT></DIV>
<DIV><FONT face=3DArial size=3D2>But I have tried to have as few as=20
possible.</FONT></DIV>
<DIV><FONT face=3DArial size=3D2></FONT>&nbsp;</DIV>
<DIV><FONT face=3DArial size=3D2>An analogous rule applies to -xis.&nbsp; A=
gain,=20
almost all of the about 60 words </FONT></DIV>
<DIV><FONT face=3DArial size=3D2>ending with -xis </FONT><FONT face=3DArial=
 size=3D2>are=20
not plural.&nbsp; </FONT></DIV>
<DIV><FONT face=3DArial size=3D2>The rule 0.5b below captures this, and the=
 few=20
exceptions.</FONT></DIV>
<DIV><FONT face=3DArial size=3D2></FONT>&nbsp;</DIV>
<DIV><FONT face=3DArial size=3D2>
<DIV><FONT face=3DArial size=3D2>if word ends with xis {</FONT></DIV>
<DIV><FONT face=3DArial size=3D2>&nbsp; if word is xis then stem is xi &amp=
;&amp;=20
stop</FONT></DIV>
<DIV><FONT face=3DArial size=3D2>&nbsp; if word is maxis then stem is maxi=20
&amp;&amp; stop</FONT></DIV>
<DIV><FONT face=3DArial size=3D2>&nbsp; if word is taxis then stem is&nbsp;=
taxi=20
&amp;&amp; stop</FONT></DIV>
<DIV><FONT face=3DArial size=3D2>&nbsp; change final xis to xes</FONT></DIV=
>
<DIV><FONT face=3DArial size=3D2>}</FONT></DIV>
<DIV><FONT face=3DArial size=3D2></FONT>&nbsp;</DIV>
<DIV>Here axis gets conflated with axes (its plural) but also with axe.&nbs=
p;=20
That seems</DIV>
<DIV>acceptable.&nbsp; (There is a singular word taxis, with plural taxes, =
but=20
both those</DIV>
<DIV>strings are far more common in their usual meaning.&nbsp; We do not wa=
nt to=20
</DIV>
<DIV>conflate taxis with tax.)</DIV>
<DIV></FONT><FONT face=3DArial size=3D2></FONT>&nbsp;</DIV></DIV>
<DIV><FONT face=3DArial size=3D2>Misc.</FONT></DIV>
<DIV><FONT face=3DArial size=3D2>I have written these as 2 separate rules b=
ut a=20
performance tweak&nbsp;might test if</FONT></DIV>
<DIV><FONT face=3DArial size=3D2>the&nbsp;</FONT><FONT face=3DArial size=3D=
2>word ends=20
with -is first.</FONT></DIV>
<DIV><FONT face=3DArial size=3D2></FONT>&nbsp;</DIV>
<DIV><FONT face=3DArial size=3D2>On a completely separate topic, the words =
"lens" is=20
another word</FONT></DIV>
<DIV><FONT face=3DArial size=3D2>that should be special cased to return "le=
ns" as=20
its stem , so that</FONT></DIV>
<DIV><FONT face=3DArial size=3D2>it conflates with lenses (and so it does n=
ot=20
conflate with the </FONT></DIV>
<DIV><FONT face=3DArial size=3D2>common computer science abbreviation for=20
length.)</FONT></DIV>
<DIV><FONT face=3DArial size=3D2></FONT>&nbsp;</DIV>
<DIV><FONT face=3DArial size=3D2>References:</FONT></DIV>
<DIV><FONT face=3DArial size=3D2>This analysis is based on the very large l=
ist of=20
words known as YAWL (Yet Another</FONT></DIV>
<DIV><FONT face=3DArial size=3D2>Word List) available from e.g. <A=20
href=3D"http://personal.riverusers.com/~thegrendel/software.html">http://pe=
rsonal.riverusers.com/~thegrendel/software.html</A></FONT></DIV>
<DIV><FONT face=3DArial size=3D2>and elsewhere.</FONT></DIV>
<DIV><FONT face=3DArial size=3D2></FONT>&nbsp;</DIV>
<DIV><FONT face=3DArial size=3D2>Hopefully helpfully yours,<BR>Steve<BR>--=20
<BR>Steven Tolkin&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <A=20
href=3D"mailto:steve.tolkin@fmr.com">steve.tolkin@fmr.com</A>&nbsp;&nbsp;&n=
bsp;&nbsp;&nbsp;=20
617-563-0516 <BR>Fidelity Investments&nbsp;&nbsp; 82 Devonshire St.=20
V1D&nbsp;&nbsp;&nbsp;&nbsp; Boston MA 02109<BR>There is nothing so practica=
l as=20
a good theory.&nbsp; Comments are by me, <BR>not Fidelity Investments, its=20
subsidiaries or affiliates.<BR></FONT></DIV></BODY></HTML>

<HTML><BODY><BR>
_____________________________________________________________________<BR>
VirusChecked by the Incepta Group plc<BR>
_____________________________________________________________________<BR>
</BODY></HTML>

------=_NextPart_000_001C_01C17D16.77E05E70--


_______________________________________________
Snowball-discuss mailing list
Snowball-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/snowball-discuss