[Snowball-discuss] Unicode version of snowball
Martin Porter
martin.porter@grapeshot.co.uk
Thu May 13 11:00:03 2004
Xiao,
I am glad you sorted it out.
I will need to return to Snowball at some point, when I have extended some
other software to handle Unicode, so that might be a good time to address
the UTF8 encoding business. I do not think it will be until next year however.
It would be possible to put UTF-8 Unicode straight in, if Snowball had not
introduced these things called 'classes'. For example, the line
define v 'aeiou{a'}{e'}{i'}{o'}{u'}{u"}'
defines v as one of the bytes in the succeeding string. If instead one had
define v as among ('a' 'e' 'i' 'o' 'u'
'{a'}' '{e'}' '{i'}' '{o'}' '{u'}' '{u"}')
it would not matter whether the macros {a'} etc were made up of 1, 2 or 3
bytes. The Snowball scripts could be rewritten to avoid the use of classes.
Retrospectively, one can say that the idea of character classes is a
mistake: the increased speed by which they are implemented should be made an
optimisation feature of among expressions of a certain shape.
Martin
> Martin,
>
>Thanks for your help, I can compile the snowball for stemming UCS2-based
russian text.
>
>It is better that snowball can stem UTF8-based text, Could you have any
plan to modify the snowball to support stemming UTF8-based text?
>
>Xiao Shibin
>TRS Ltd.
>