[Snowball-discuss] 8-bit and 16-bit characters support

Martin Porter martin_porter@SoftHome.net
Wed Jun 4 11:58:01 2003


>It actually packs into up to six octets.

Yes, I was vaguely aware that it 'went on forever'. But I think for all the
characters that might be useful in a stemming context, i.e. in Snowball, the
original 64K holds everything (Polish L, Hungarian double acute etc etc).
And the large numbers are for Chinese ideograms and other exotica.

The reason the ranges matter is that Snowball uses bitmaps for character
sets, of size N-n bits where n and N are the smallest and largest code
values in the charater set. N for Cyrillic is not too high, and in any case
N-n for Cyrillic is only about 33. But there is a general assumption than N
will not be so large that the bitmaps explode in size.