[Snowball-discuss] 16 bit characters in Snowball

Richard Boulton richard@tartarus.org
25 May 2002 14:37:44 +0100


On Fri, 2002-05-24 at 20:47, Andreas Jung wrote:
> Seems that the problem is still not solved.
> I re-created all stemmers with and without -w option and in
> both cases snowball produced identical sources. Any ideas why?

Yes, -w doesn't change the output.  What it does is allow snowball
programs to use character values in the range 0-65535 instead of 0-255.

A snowball program which can be generated successfully without -w will
not be affected by use of -w.  However, a snowball program which uses
characters out of the range 0-255 will not be generated successfully
without -w.

If you're using -w to generate snowball output, you must also set 
the typedef of "symbol" in api.h to something appropriate when you
compile the sources: see the comment at the start of api.h

Note that using -w and setting the size of symbol still doesn't
guarantee that the snowball program is using a 16 bit character set: see
the russian/stem.sbl file for an example: by default it uses KOI8-R (in
which all the character codes fit in one byte), but if you change the
comments around you can make it use Unicode instead.

-- 
Richard

_______________________________________________________________

Don't miss the 2002 Sprint PCS Application Developer's Conference
August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm

_______________________________________________
Snowball-discuss mailing list
Snowball-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/snowball-discuss