[Snowball-discuss] a simple algorithm problem

Martin Porter martin.porter at grapeshot.co.uk
Mon Dec 13 17:41:23 GMT 2004


ayhan,

Well, your example word is

asaiit*lard*a

where * is the two byte sequence C4 B1 (hex),
                              or (110)0100 (10)110001 (binary)

which is the utf-8 encoding of 01000110001 (binary) or 131 (hex), which is
the Unicode character for a dotless i.

In other words, you think of it as one character, which in Unicode it is,
but Snowball thinks it is two characters, because it occupies two bytes.

You can run Snowball in 16-bit character mode and so represent the Turkish
alphabet in Unicode. But the special characters you are defining suggest
that you might be trying to get the stemmer working in 8 bit ASCII with
iso-latin 1 extensions.

My inclination would be to get it going as an 8-bit per character program
and worry about Unicode later. 

Martin






More information about the Snowball-discuss mailing list