[Snowball-discuss] a simple algorithm problem
Martin Porter
martin.porter at grapeshot.co.uk
Mon Dec 13 17:41:23 GMT 2004
ayhan,
Well, your example word is
asaiit*lard*a
where * is the two byte sequence C4 B1 (hex),
or (110)0100 (10)110001 (binary)
which is the utf-8 encoding of 01000110001 (binary) or 131 (hex), which is
the Unicode character for a dotless i.
In other words, you think of it as one character, which in Unicode it is,
but Snowball thinks it is two characters, because it occupies two bytes.
You can run Snowball in 16-bit character mode and so represent the Turkish
alphabet in Unicode. But the special characters you are defining suggest
that you might be trying to get the stemmer working in 8 bit ASCII with
iso-latin 1 extensions.
My inclination would be to get it going as an 8-bit per character program
and worry about Unicode later.
Martin
More information about the Snowball-discuss
mailing list