[Snowball-discuss] Download tarball inconsistencies
Olly Betts
olly at survex.com
Sun Sep 10 21:40:49 BST 2006
On Sun, Sep 10, 2006 at 10:26:36AM +0100, richard at lemurconsulting.com wrote:
> On Sun, Sep 10, 2006 at 04:59:24AM +0100, Olly Betts wrote:
> > I find it somewhat suprising that they don't contain exactly the same
> > set of .sbl files!
>
> They should do now.
Thanks for fixing this so quickly.
> (And the timestamps should be the same, too, not that
> they're particularly meaningful.)
It certainly doesn't hurt to have them consistent though.
> The stem.sbl files assume encoding in Latin-1 - but since for the
> characters they accept this is the same as Unicode, they can be compiled as
> unicode algorithms using the appropriate switch to the snowball compiler
> (IIRC, -u). The encodings expected by the other stem-*.sbl files should be
> obvious.
The exception is the Russian "stem.sbl" which is KOI-8-R; there
stem-Unicode.sbl is the unicode version (I realise you probably know
this, but I thought it worth pointing out for the benefit of others...)
Also the Romanian stemmers both use unicode characters outside of Latin1.
Hmm, actually things aren't quite right still:
http://snowball.tartarus.org/dist/snowball_web_and_code.tgz
is described as:
Snowball, algorithms, and libstemmer library, and documentation
This contains all the source code for snowball (but not the generated
source files), and also the full documentation of the stemming
algorithms.
But it actually *DOES* contain generated source files (stem.c and stem.h
in each directory), and it also contains an extra .sbl file compared to
snowball_code.tgz (actually just a copy of russian/stem.sbl):
snowball_web_and_code/algorithms/russian/stem-KOI8-R.sbl
It also has stop.txt for most of the algorithm subdirectories which
snowball_code doesn't have.
Also snowball_web_and_code doesn't have libstemmer - I think it has the
older code that libstemmer replaces.
Cheers,
Olly
More information about the Snowball-discuss
mailing list