[Snowball-discuss] Using UTF-16 with libstemmer_c
Hai Zaar
haizaar at gmail.com
Mon Apr 14 20:45:00 BST 2008
On Mon, Apr 14, 2008 at 12:04 PM, Martin Porter
<martin.porter at grapeshot.co.uk> wrote:
>
>
> Ah, I think that is okay so long as you use the -w option in compiling
> snowball. You then get 16 bits per char. It is not strictly UTF-16,
And what encoding I tell snowball to work with?
This trick will work for ASCII characters only, since ASCII maps maps
1-to-1 on unicode character values.
But it would certainly fail on characters beyond ASCII.
> since no attempt is made to look after characters greater than 0x10000,
> but I don't think there should be any problems.
>
> (I'm not sure how the -w option fits with use of libstemmer, and am
> hoping Richard Boulton will add an answer.)
>
>
> Martin
>
>
>
>
> On Sun, 2008-04-13 at 22:26 +0300, Hai Zaar wrote:
> > On Sun, Apr 13, 2008 at 10:22 PM, Martin Porter
> > <martin.porter at grapeshot.co.uk> wrote:
> > >
> > > Hai Zaar,
> > >
> > > No, we don't deal with UTF-16 currently. Is that a problem to you?
> > Yes, it is. In my application (C++) I work with strings using ICU
> > library, which holds all strings in UTF-16 format. That means that in
> > order to stem a string, I have to convert it to UTF-8, pass to
> > stemmer, and then convert the stemmed result back to UTF-16. This
> > looks like a significant overhead.
> >
>
>
>
--
Zaar
More information about the Snowball-discuss
mailing list