[Snowball-discuss] Using UTF-16 with libstemmer_c

Hai Zaar haizaar at gmail.com
Mon Apr 14 20:45:00 BST 2008


On Mon, Apr 14, 2008 at 12:04 PM, Martin Porter
<martin.porter at grapeshot.co.uk> wrote:
>
>
>  Ah, I think that is okay so long as you use the -w option in compiling
>  snowball. You then get 16 bits per char. It is not strictly UTF-16,
And what encoding I tell snowball to work with?
This trick will work for ASCII characters only, since ASCII maps maps
1-to-1 on unicode character values.
But it would certainly fail on characters beyond ASCII.

>  since no attempt is made to look after characters greater than 0x10000,
>  but I don't think there should be any problems.
>
>  (I'm not sure how the -w option fits with use of libstemmer, and am
>  hoping Richard Boulton will add an answer.)
>
>
>  Martin
>
>
>
>
>  On Sun, 2008-04-13 at 22:26 +0300, Hai Zaar wrote:
>  > On Sun, Apr 13, 2008 at 10:22 PM, Martin Porter
>  > <martin.porter at grapeshot.co.uk> wrote:
>  > >
>  > >  Hai Zaar,
>  > >
>  > >  No, we don't deal with UTF-16 currently. Is that a problem to you?
>  > Yes, it is. In my application (C++) I work with strings using ICU
>  > library, which holds all strings in UTF-16 format. That means that in
>  > order to stem a string, I have to convert it to UTF-8, pass to
>  > stemmer, and then convert the stemmed result back to UTF-16. This
>  > looks like a significant overhead.
>  >
>
>
>



-- 
Zaar



More information about the Snowball-discuss mailing list