[Snowball-discuss] Sample files for Spanish and English
Olly Betts
olly at survex.com
Tue May 10 05:21:32 BST 2016
On Sat, May 07, 2016 at 12:16:48PM +0200, P.O. Jonsson wrote:
> I have noticed „gaps“ or empty lines in the Spanish samples voc.txt and
> output.txt at around these positions
>
> ES GAPS
It looks like these gaps all correspond - i.e. where there's a blank
line in spanish/voc.txt, there's a corresponding blank link in
spanish/output.txt (and that explains why the automated stemming tests
pass, since an empty input stems to an empty output).
$ grep -n '^$' spanish/*.txt
spanish/output.txt:2271:
spanish/output.txt:6580:
spanish/output.txt:10955:
spanish/output.txt:12435:
spanish/output.txt:12437:
spanish/output.txt:12439:
spanish/output.txt:14444:
spanish/output.txt:15170:
spanish/output.txt:17580:
spanish/output.txt:17582:
spanish/output.txt:18189:
spanish/output.txt:21636:
spanish/output.txt:26892:
spanish/voc.txt:2271:
spanish/voc.txt:6580:
spanish/voc.txt:10955:
spanish/voc.txt:12435:
spanish/voc.txt:12437:
spanish/voc.txt:12439:
spanish/voc.txt:14444:
spanish/voc.txt:15170:
spanish/voc.txt:17580:
spanish/voc.txt:17582:
spanish/voc.txt:18189:
spanish/voc.txt:21636:
spanish/voc.txt:26892:
Are you seeing some problem due to these blank lines?
> I also found what I think is a miss in the english sample files:
>
> Output Gap sa
>
> voc s missing
I don't see any blank entries for english/*.txt - I think you must mean
"porter" rather than "english". The "porter" algorithm indeed stems "s"
to an empty string, which seems to match the behaviour described in
Martin Porter's 1980 paper:
http://tartarus.org/~martin/PorterStemmer/def.txt
So that's a feature, not a bug. It's certainly not a helpful behaviour,
but the "porter" implementation is a historic one so it doesn't make
sense to enhance it. The "english" stemmer *is* the "enhanced porter",
and is recommended for general use over "porter".
> I have corrected my own versions but maybe someone with access can
> have a look and amend?
Anyone can submit a patch or pull request.
Cheers,
Olly
More information about the Snowball-discuss
mailing list