[Snowball-discuss] Unicode support

Andreas Jung andreas@andreas-jung.com
Sun, 19 May 2002 11:43:55 -0400


I am currently running into problems while trying to integrate Snowball
with unicode support in my Python bindings.

Inside Python I convert a Python unicode to a UCS-2 string, e.g.
u'foo' gets converted to a 6 byte char string with the bytes 102,0, 111,0,
111,0.
When I call SN_set_current(env, 6, ucs2char) and call the stemmer function
then
Snowball sems to recognize the first character and returns a buffer of
length 1.
This also is true for longer strings like u'computer'. The stemmed word is
only
1 byte long.

Am I missing something?

Andreas

----- Original Message -----
From: "Martin Porter" <martin_porter@softhome.net>
To: "Andreas Jung" <andreas@andreas-jung.com>
Cc: <snowball-discuss@lists.sourceforge.net>
Sent: Thursday, May 16, 2002 04:39
Subject: Re: [Snowball-discuss] Unicode support


> At 08:43 PM 5/15/02 -0400, Andreas Jung wrote:
> >Do you speak about 16 bit fixed encoding? I only know USC-2 that
> >fulfills this requirement. Is it that what you mean?
> >
> >Andreas
>
> Andreas,
>
> I take it you mean UCS-2, not USC-2.
>
> Yes, Snowball expects a typdef of 'symbol' to 'unsigned char' (one byte),
or
> 'unsigned short' (two bytes or more), or 'unsigned long' (4 bytes or more)
> ... so 'unsigned char' can be used for UCS-1, 'unsigned short' for UCS-2.
>
> But of course none of the Snowball stemmers recognise Unicode characters
> above 32K, let alone 64K, so you can encode high-value characters as a
> sequence of two-byte characters, and pass them into Snowball compiled with
> 'symbol' as 'unsigned short'.
>
> This is precisely what UTF-16 does. Characters over 0xFFFF are split into
> two, each of which is in a spare range of unicode. Snowball would
therefore
> handle UTF-16 characters okay in this scheme.
>
> Essentially, each Snowball stemmer has a fixed list of vowels, and
anything
> else is assumed to be a consonant. A character above 0xFFFF would
therefore
> be treated as a consonant list.
>
> It would have been possible to codegenerate stemmers in C that use UTF-8
> direct, but (a) this would not have extended to Java, with its 16-bit
> characters and (b) the slowness of character cursor movement (currently
> implemented as a simple z->c++; or z->c--;) would probably have made the
> final stemmers worse than bearing the overhead of translating UTF-8 to and
> from UCS-2 for each call, always assuming that is what you have to do.
>
> Martin
>
>
> _______________________________________________________________
>
> Have big pipes? SourceForge.net is looking for download mirrors. We supply
> the hardware. You get the recognition. Email Us: bandwidth@sourceforge.net
> _______________________________________________
> Snowball-discuss mailing list
> Snowball-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/snowball-discuss
>


_______________________________________________________________
Hundreds of nodes, one monster rendering program.
Now that's a super model! Visit http://clustering.foundries.sf.net/

_______________________________________________
Snowball-discuss mailing list
Snowball-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/snowball-discuss