[Snowball-discuss] Stemming 'communing' and 'communed'

Michael Edwards mbedwards at gmail.com
Thu Mar 29 02:46:55 BST 2007


Greetings!

I am about to release the first version of my Porter2 stemming algorithm for
PHP (native PHP code, no C, no extensions). I have tested the algorithm
against the sample vocabulary word lists and am down to one error. Where the
sample word lists show that "communing" should stem to "commune" my
algorithm stems it to "commun". While not listed in the sample vocabulary,
"communed" is also stemmed to "commune" using the online Porter2 demo hosted
at the snowball.tartarus.org site, while my algorithm stems it to "commun".
I have run through the spec 'by-hand' many times and cannot figure out how
to get to the proper stemming.

The below is a run-thru of how I am interpreting the spec to get to
'commun':

1) Begin with 'communing'
2) R1: ing (per prefix exceptions for 'gener', 'commun', 'arsen'), R2: null
2) Prelude
3) Step 0
4) Step 1a
5) Step 1b, delete 'ing', get 'commun',
Note: try as I might, I cannot figure out how to come away with the
conclusion that the word is short and thus I should add an 'e' to the end.
6) Step 1c
7) Step 2
8) Step 3
9) Step 4
10) Step 5
11) Postlude

Result: 'commun'

Any thoughts or clarification would be much appreciated.

Best regards,
Michael Edwards
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.tartarus.org/mailman/private/snowball-discuss/attachments/20070328/63c20fa4/attachment.html


More information about the Snowball-discuss mailing list