[Snowball-discuss] add suffix acy to Porter stemmer for English
Steve Tolkin
stevetolkin at comcast.net
Sat Jan 2 15:47:36 GMT 2010
Summary: I suggest adding a rule to remove the suffix "acy",
This will improve the Porter stemmer for English on about 60 words, some of
them quite common.
Almost all the words ending with acy (but not ending with cracy) correspond
to a word that ends "ate".
For example, accuracy and accurate would both have the stem "accur".
Other words to benefit from this change are adequacy, advocacy, celibacy,
intimacy, intricacy, etc.
Making this change would require the stemmer to also remove the suffix
"acies".
As usual the words must be long enough, by "measure" of quasi syllables, for
the rule to apply.
Can the Snowball language handle a rule that applies to words ending acy but
not cracy?
Details:
In yawl.lst there are 133 words ending -acy. Excluding the 44 that end
-cracy leaves 89.
68 of those 89 do have a word that ends with -ate.
There are 21 words that do not have a matching -ate word .
However, in many of these cases it still is good to remove -acy, producing
e.g. benign, conspir, procur, prolific, suprem.
Removing the suffix acy is good in almost every case where it would apply,
and never creates a bad conflation.
It does miss a few good conflations.
We would not stem some of those words that are too short (by the Porter
"measure"), e.g., lacy, racy, spacy, etc.
Because of this requirement the rule will miss some good conflations,
e.g. curacy and curate, legacy and legate, piracy and pirate, etc.
But the measure requirement lets us to avoid some bad conflations,
especially fallacy -> fall.
We also avoid the following bad conflations (bad based on meaning, even if
OK on etymology),
e.g., lunacy and lunate, primacy and primate, etc.
I know more about stemming than about the Snowball language. (I usePerl.)
Can the Snowball language handle the rule of doing -acy but not -cracy?
Thanks,
Steve Tolkin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.tartarus.org/mailman/private/snowball-discuss/attachments/20100102/1b4dc980/attachment.htm>
More information about the Snowball-discuss
mailing list