[Snowball-discuss] 8-bit and 16-bit characters support

Martin Porter martin_porter@SoftHome.net
Wed Jun 4 09:21:02 2003


Eugen,

I was really thinking aloud. I would need to rewrite the snowball scripts to
use 'among's rather than character groups. 'goto vowel' was just a way of
illustrating the problem.

The way to make it work with utf-8 encoded data is to put the unicode
Russian characters into 2 byte form before calling Snowball, and then repack
as utf-8 afterwards. Tedious, I know.

I said 2 or 3 byte characters because in utf-8, a character value above 127
packs into either 2 or 3 bytes. Is that not so?

I will look at http address you sent.

Martin