[Snowball-discuss] Unicode & the snowball stemming algorithm

Martin Porter martin.f.porter at gmail.com
Wed Dec 10 17:40:39 GMT 2014


Griz,

> Can't Unicode characters be used instead of  o" and u"?

Yes, they can. When I started writing the stemmers, I was careful to
restrict the character set in which they were written to pure ASCII.
This was before the days of Unicode, when there was such a variety of
representations of the letters of Cyrillic, that unless I'd
represented them in a transliterated form in the Roman alphabet it
would have led to confusion. The Russian one was done very early on,
and I stuck with the same convention in doing the others. Evren, I
think, just followed that convention.

I've not included non-ASCII characters in any of my own snowball
scripts, but I've seen others using them, without any apparent
difficulty.

Good luck with the Azerbaijani stemmer!

Martin

On 12/10/14, Griz <grizzly.kenges at gmail.com> wrote:
> Dear Martin,
>
> Hello, my name is Ken Keyes, and I am attempting to write a tokenizer
> based on Evren (Kapusuz) Çilden's work:
> http://snowball.tartarus.org/algorithms/turkish/stemmer.html for
> Azerbaijani, a close relative of Turkish.
>
> I notice that Evren uses this notation o" for ö, and u" for ü.
>
> Can't Unicode characters be used instead of  o" and u"?
>
> Many thanks in advance for answering,
>
> Ken
>



More information about the Snowball-discuss mailing list