[Snowball-discuss] Unicode version of snowball

Martin Porter martin.porter@grapeshot.co.uk
Thu May 13 11:00:03 2004


Xiao,

I am glad you sorted it out.

I will need to return to Snowball at some point, when I have extended some
other software to handle Unicode, so that might be a good time to address
the UTF8 encoding business. I do not think it will be until next year however.

It would be possible to put UTF-8 Unicode straight in, if Snowball had not
introduced these things called 'classes'. For example, the line

define v 'aeiou{a'}{e'}{i'}{o'}{u'}{u"}'

defines v as one of the bytes in the succeeding string. If instead one had

define v as among ('a' 'e' 'i' 'o' 'u' 
                   '{a'}' '{e'}' '{i'}' '{o'}' '{u'}' '{u"}')

it would not matter whether the macros {a'} etc were made up of 1, 2 or 3
bytes. The Snowball scripts could be rewritten to avoid the use of classes.

Retrospectively, one can say that the idea of character classes is a
mistake: the increased speed by which they are implemented should be made an
optimisation feature of among expressions of a certain shape. 

Martin







> Martin,
>
>Thanks for your help, I can compile the snowball for stemming UCS2-based
russian text.
>
>It is better that snowball can stem UTF8-based text, Could you have any
plan to modify the snowball to support stemming UTF8-based text?
>
>Xiao Shibin
>TRS Ltd.
>