[Snowball-discuss] ANSI C Generator feature

Marios Sintichakis mms at archetypon.com
Fri Jul 15 13:32:50 BST 2005


Hello, 
 
Consider the following snowball program (test.sbl).
 
stringescapes {}
externals ( main )
stringdef   a   hex   '03b1'  // Greek Small Letter Alfa
stringdef   i`  hex   '03af'  // Greek Small Letter Iota With Tonos
define main as '{a}{i`}'
 
When test.sbl is compiled with
 
snowball test.sbl -u -o test
 
generates
 
...
static symbol s_0[] = { 0xCE, 0xB1, 0xCE, 0xAF };
...
 
as expected (the byte sequence 0xCE 0xB1 is Greek Small Letter Alfa in UTF-8).
However, when compiled with
 
snowball test.sbl -w -o test
 
the generated code reads
 
...
static symbol s_0[] = { 0xB1, 0xAF };
...

I am running Snowball in a Win2K server. I have compiled the Snowball compiler
with (cygwin) gcc 3.4.4 as well as with Microsoft C/C++ Compiler 13.10.3077. The 
results are identical in both cases.
 
The following modification in method wlitarray (line 94 of source file generator.c)
 
for (j=8*sizeof(symbol)-4; j>=0; j-=4) wh(g, ch >> j & 0x0f);
 
along with a redefinition of symbol as
 
typedef wchar_t symbol; // unsigned short works as well
 
fixed the problem: now
 
snowball test.sbl -w -o test
 
generates
 
...
static symbol s_0[] = { 0x03B1, 0x03AF };
...

However, I am wondering if UTF-8 is the preferred internal encoding for ANCI C stemmers.
 
 
Best regards,
Marios.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.tartarus.org/mailman/private/snowball-discuss/attachments/20050715/994740f2/attachment.htm


More information about the Snowball-discuss mailing list