[Snowball-discuss] 8-bit and 16-bit characters support

Martin Porter martin_porter@SoftHome.net
Wed Jun 4 07:02:04 2003


Oleg,

No, Snowball is either set up for 1 byte character use, or 2 byte character
use, but it has occurred to me that implementing the stemmers on utf-8 data
may not be so difficult, even with no changes to the Snowball compiler.

If you treat utf-8 data as a pure byte stream of characters (so one utf-8
character corresponds to 2 or 3 bytes) the stemmers almost work, but the
thing that goes wrong is the single character tests for characters in a
certain class. So one would have to replace

    goto vowel  // vowel defined by 'define vowel '...'

by

    goto among ('a' 'e' 'i' 'o' 'u')

or more precisely

    goto among ('[a]' '[e]' ... )

where [a] etc are macros defining the vowels as utf encoded byte sequences.

Perhaps that is how all the stemmers should have been written.

Can you point me to some plain text somewhere in the web that gives a bit of
russian in utf-8 encoded Unicode ? I might play around with this idea.

Martin