[Snowball-discuss] Out of date diffs?

J Smith jsmith@tutorbuddy.com
Wed Nov 27 00:14:01 2002


I finally got around to updating the stem-php extension for PHP (which is=
=20
obviously based on Snowball) when I noticed that two stemmed vocabulary f=
iles=20
seem to be out of date or something.

Specifically, the English (Porter2) files and the Norwegian files.=20

I ran the latest Snowball ANSI stemmers on the voc.txt files and in both=20
cases, the output didn't match the expected output.txt file available on =
the=20
Snowball web site.=20

In the case of the English stemmer, 176 words produced the wrong output. =
It=20
seems they're all words with either one or two letters, such as "a", "ac"=
,=20
"ap", etc. In each case, the stemmed output is an empty string.

In the Norwegian stemmer, nearly half of the output doesn't match up at a=
ll,=20
with 10215 of the 20628 words failing.

Is this a case of the output.txt/diff.txt files being out of date, or the=
=20
stemmers themselves being out of date.

If anybody would like to see what I'm getting for output, I can post them=
 to a=20
web site...

Cheers,

J