[Snowball-discuss] Re: Bug report?
Tue Oct 7 19:45:02 2003
Yes, I was aware of this, and should explain:
The Porter stemmer, as originally defined, reduces "s" to null, and is
implemented in the stemmer at
The version of the Porter stemmer which I distributed for many years stems
"s" however. This is because it has a couple of improvements (points of
from the published algorithm which everyone has come to accept. These
are in the slightly different version of the stemmer at
and are clearly marked DEPARTURE in the commments in the ANSI C version of the
stemmer - as well being described in the accompanying text.
I can't alter this now, bugs or not, because of the status of the Porter stemmer
as a described algorithm, but the Snowball Porter2 stemmer fixes these
many others besides.
I would agree that it is not helpful to stem "s" to null, but would not
stemming to null is invariably bad (although none of the Snowball stemmers on
current release do so). See the notes introducing the Russian stemmer.
I can't explain the problems you had with email I'm afraid. I've certainly
received executables, and files containing viruses, as unwanted attachments,
within the past few months.
> I found a phrase
> "In any case a string of length 1 will be unchanged if passed
>through the algorithm".
>Indeed, I always thought a stemmer should NOT produce empty stems, no? This
>is very inconvenient in practice since it changes file formats, word counts,
>However, it seems the algorithm does strip "s" -> "". (This is the only rule
>producing empty strings.) In effect, the program at
>http://snowball.tartarus.org/porter/stemmer.html does it; I attach the
>corresponding files (I found no way to send the executable due to a paranoic
>antivirus software at Tartarus).
>Is this correct? Wouldn't you rather change the unconditional rule
> S -> cats -> cat
> (*v or *c) S -> cats -> cat