[Snowball-discuss] Sample files for Spanish and English

Olly Betts olly at survex.com
Tue May 10 05:21:32 BST 2016


On Sat, May 07, 2016 at 12:16:48PM +0200, P.O. Jonsson wrote:
> I have noticed „gaps“ or empty lines in the Spanish samples voc.txt and
> output.txt at around these positions
> 
> ES GAPS

It looks like these gaps all correspond - i.e. where there's a blank
line in spanish/voc.txt, there's a corresponding blank link in
spanish/output.txt (and that explains why the automated stemming tests
pass, since an empty input stems to an empty output).

$ grep -n '^$' spanish/*.txt
spanish/output.txt:2271:
spanish/output.txt:6580:
spanish/output.txt:10955:
spanish/output.txt:12435:
spanish/output.txt:12437:
spanish/output.txt:12439:
spanish/output.txt:14444:
spanish/output.txt:15170:
spanish/output.txt:17580:
spanish/output.txt:17582:
spanish/output.txt:18189:
spanish/output.txt:21636:
spanish/output.txt:26892:
spanish/voc.txt:2271:
spanish/voc.txt:6580:
spanish/voc.txt:10955:
spanish/voc.txt:12435:
spanish/voc.txt:12437:
spanish/voc.txt:12439:
spanish/voc.txt:14444:
spanish/voc.txt:15170:
spanish/voc.txt:17580:
spanish/voc.txt:17582:
spanish/voc.txt:18189:
spanish/voc.txt:21636:
spanish/voc.txt:26892:

Are you seeing some problem due to these blank lines?

> I also found what I think is a miss in the english sample files:
> 
> Output Gap sa
> 
> voc s missing

I don't see any blank entries for english/*.txt - I think you must mean
"porter" rather than "english".  The "porter" algorithm indeed stems "s"
to an empty string, which seems to match the behaviour described in
Martin Porter's 1980 paper:

http://tartarus.org/~martin/PorterStemmer/def.txt

So that's a feature, not a bug.  It's certainly not a helpful behaviour,
but the "porter" implementation is a historic one so it doesn't make
sense to enhance it.  The "english" stemmer *is* the "enhanced porter",
and is recommended for general use over "porter".

> I have corrected my own versions but maybe someone with access can
> have a look and amend?

Anyone can submit a patch or pull request.

Cheers,
    Olly



More information about the Snowball-discuss mailing list