[Snowball-discuss] Stemming 'communing' and 'communed'

Michael Edwards mbedwards at gmail.com
Thu Mar 29 11:28:54 BST 2007


On 3/29/07, Martin Porter <martin.porter at grapeshot.co.uk> wrote:
> > ... my algorithm stems it to "commun". I have run through the spec
> > 'by-hand' many times and cannot figure out how to get to the proper
> > stemming.
> >
> The reason is that prefix 'commun' is handled specially by Porter2 (see
> the 'mark_regions' routine) so that in effect it is treated as one
> syllable, rather than two syllables. So 'communing' behaves like
> 'tuning' etc. Similarly Porter2 stems 'communism' to 'communism' while
> Porter stems 'communism' to 'commun'.
>
> Were you thinking of contributing your PHP version to
>
> http://snowball.tartarus.org/otherlangs/index.html

Thanks for the reply!

I'm definitely planning to contribute the PHP version to the community
when I am confident it performs well in a production setting.

I currently have 'gener', 'commun', and 'arsen' as the exceptions you
reference. If I am correct, what you are saying is that I should
always treat these exceptional prefixes as short syllables? It is not
clear to me from reading the spec's definition of short syllables and
short words that I should be doing this. Rather, it reads as though
the only difference is in the setting of R1 which is not intrinsically
linked to the definition of short syllables or short words in the
spec. So, I am just looking for a little more clarification so that I
can try to future-proof my code with respect to additional exceptional
prefixes that may be added down the road.

Best regards,
Michael



More information about the Snowball-discuss mailing list