[Snowball-discuss] Using UTF-16 with libstemmer_c

Mon Apr 14 10:04:49 BST 2008

Ah, I think that is okay so long as you use the -w option in compiling
snowball. You then get 16 bits per char. It is not strictly UTF-16,
since no attempt is made to look after characters greater than 0x10000,
but I don't think there should be any problems.

(I'm not sure how the -w option fits with use of libstemmer, and am
hoping Richard Boulton will add an answer.)

Martin

On Sun, 2008-04-13 at 22:26 +0300, Hai Zaar wrote:
> On Sun, Apr 13, 2008 at 10:22 PM, Martin Porter
> <martin.porter at grapeshot.co.uk> wrote:
> >
> >  Hai Zaar,
> >
> >  No, we don't deal with UTF-16 currently. Is that a problem to you?
> Yes, it is. In my application (C++) I work with strings using ICU
> library, which holds all strings in UTF-16 format. That means that in
> order to stem a string, I have to convert it to UTF-8, pass to
> stemmer, and then convert the stemmed result back to UTF-16. This
> looks like a significant overhead.
>