[Snowball-discuss] UTF-8 support

Martin Porter martin.porter at grapeshot.co.uk
Mon May 23 16:02:47 BST 2005


At last, some developments in Snowball!

I've put in a switch, -u or -utf8, to generate stemmers that handle UTF-8
encoded Unicode characters. Full documentation will follow, although from
the outside little is different. The ISO-Latin-I sources of the Roman
alphabet stemmers are the same; the Russian stemmer has a stem-Unicode.sbl
variant. 

Some of the stemmers needed small adjustments. If p marks a position in the
string,

    ... setmark p ...

the old test

    $p > 3

to see if p is beyond the first three characters no longer applies, since
the number in p is a byte offset from the start of the string, not a
character offset. Instead you need something like

    ... hop 3 setmark x ...
    ... setmark p ...

and later

    $p > x

So marks should be tested relative to other marks, and not against absolute
numeric values. 'size' still measures the byte size of a string, not the
character size.

The same sources can be used to generate UTF-8 and ISO-Latin-1 encodings so
long as code values are defined in hex, e.g.

    stringdef a^   hex '83'  // a-circumflex

but obviously if UTF-8 sequences occur inside literal strings in the
snowball source scripts, you can only use them to generate stemmers for
UTF-8 encoded text.



 




More information about the Snowball-discuss mailing list