[Snowball-discuss] Out of date diffs?
J Smith
jsmith@tutorbuddy.com
Wed Nov 27 00:14:01 2002
I finally got around to updating the stem-php extension for PHP (which is=
=20
obviously based on Snowball) when I noticed that two stemmed vocabulary f=
iles=20
seem to be out of date or something.
Specifically, the English (Porter2) files and the Norwegian files.=20
I ran the latest Snowball ANSI stemmers on the voc.txt files and in both=20
cases, the output didn't match the expected output.txt file available on =
the=20
Snowball web site.=20
In the case of the English stemmer, 176 words produced the wrong output. =
It=20
seems they're all words with either one or two letters, such as "a", "ac"=
,=20
"ap", etc. In each case, the stemmed output is an empty string.
In the Norwegian stemmer, nearly half of the output doesn't match up at a=
ll,=20
with 10215 of the 20628 words failing.
Is this a case of the output.txt/diff.txt files being out of date, or the=
=20
stemmers themselves being out of date.
If anybody would like to see what I'm getting for output, I can post them=
to a=20
web site...
Cheers,
J