[Snowball-discuss] 16 bit characters in Snowball
Andreas Jung
andreas@andreas-jung.com
Sat, 25 May 2002 12:58:42 -0400
Why does snowball return a result as ascii when I pass a UTF-16 string:
#include "stem.h"
#include "header.h"
int main(int argc, char **argv) {
int i;
struct SN_env *z;
char b[16] = {'a',0, 'a',0, 'r',0, 'g',0,
'a',0, 'u',0, 'e',0, 'r',0 };
int l;
z = german_create_env();
SN_set_current(z, 8, (unsigned short *)b);
german_stem(z);
printf("%d\n",z->l);
for (i=0;i<z->l;i++) printf("%d %c\n",z->p[i],z->p[i]);
german_close_env(z);
return 0;
}
Output:
yetix@/develop/REPOSITORY/snowball/website/german(80)% ./a.out
6
97 a
97 a
114 r
103 g
97 a
117 u
symbol is defined in api.h as unsigned short.
Andreas
~
----- Original Message -----
From: "Richard Boulton" <richard@tartarus.org>
To: "Andreas Jung" <andreas@zope.com>
Cc: "Snowball discussion list" <snowball-discuss@lists.sourceforge.net>
Sent: Saturday, May 25, 2002 09:37
Subject: Re: [Snowball-discuss] 16 bit characters in Snowball
> On Fri, 2002-05-24 at 20:47, Andreas Jung wrote:
> > Seems that the problem is still not solved.
> > I re-created all stemmers with and without -w option and in
> > both cases snowball produced identical sources. Any ideas why?
>
> Yes, -w doesn't change the output. What it does is allow snowball
> programs to use character values in the range 0-65535 instead of 0-255.
>
> A snowball program which can be generated successfully without -w will
> not be affected by use of -w. However, a snowball program which uses
> characters out of the range 0-255 will not be generated successfully
> without -w.
>
> If you're using -w to generate snowball output, you must also set
> the typedef of "symbol" in api.h to something appropriate when you
> compile the sources: see the comment at the start of api.h
>
> Note that using -w and setting the size of symbol still doesn't
> guarantee that the snowball program is using a 16 bit character set: see
> the russian/stem.sbl file for an example: by default it uses KOI8-R (in
> which all the character codes fit in one byte), but if you change the
> comments around you can make it use Unicode instead.
>
> --
> Richard
>
> _______________________________________________________________
>
> Don't miss the 2002 Sprint PCS Application Developer's Conference
> August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm
>
> _______________________________________________
> Snowball-discuss mailing list
> Snowball-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/snowball-discuss
>
_______________________________________________________________
Don't miss the 2002 Sprint PCS Application Developer's Conference
August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm
_______________________________________________
Snowball-discuss mailing list
Snowball-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/snowball-discuss