[Snowball-discuss] 8-bit and 16-bit characters support

Martin Porter martin_porter@SoftHome.net
Wed Jun 4 09:20:01 2003


>Return-Path: <bu@lucky.net>
>Delivered-To: martin_porter@SoftHome.net
>Date: Wed, 04 Jun 2003 10:08:30 +0300
>From: Eugen Bushuev <bu@lucky.net>
>User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0.2)
Gecko/20021120 Netscape/7.01
>X-Accept-Language: ru, en-us
>To: Martin Porter <martin_porter@SoftHome.net>
>CC: Oleg Bartunov <oleg@sai.msu.su>
>Subject: Re: [Snowball-discuss] 8-bit and 16-bit characters support
>References: <courier.3EDD8B20.00003D70@softhome.net>
>X-Verify-Sender: verified
>
>Hi.
>This question was risen by me. You can get bit of russian utf-8 text at 
>http://ox.carrier.kiev.ua/~bu/test/fetch/novosti_utf8.html.
>
>About you advice - i can't find neither in russian or english .sbl 
>something similar "goto v[owel]" directives. I tried to add current 
>character size to the SN_env structure and replace sizeof(symbol) with 
>z->sizeOfChar in all memory allocation procedures. Also i tried to play 
>with incrementing z->c, but it gave me nothing since i alsmost don't 
>understood how does it work.
>
>I'm trying to make it out because i need tsearch to work with UTF-8. 
>UTF-8 is used because postgres uses it as "Unicode", and besides this i 
>need to process data in several languagies, at least English, Russian 
>and Ukrainian.
>
>And, btw, why 2 and 3 characters? I thought that english text uses 1 
>byte and russian - 2 bytes...
>
>Martin Porter wrote:
>
>>Oleg,
>>
>>No, Snowball is either set up for 1 byte character use, or 2 byte character
>>use, but it has occurred to me that implementing the stemmers on utf-8 data
>>may not be so difficult, even with no changes to the Snowball compiler.
>>
>>If you treat utf-8 data as a pure byte stream of characters (so one utf-8
>>character corresponds to 2 or 3 bytes) the stemmers almost work, but the
>>thing that goes wrong is the single character tests for characters in a
>>certain class. So one would have to replace
>>
>>    goto vowel  // vowel defined by 'define vowel '...'
>>
>>by
>>
>>    goto among ('a' 'e' 'i' 'o' 'u')
>>
>>or more precisely
>>
>>    goto among ('[a]' '[e]' ... )
>>
>>where [a] etc are macros defining the vowels as utf encoded byte sequences.
>>
>>Perhaps that is how all the stemmers should have been written.
>>
>>Can you point me to some plain text somewhere in the web that gives a bit of
>>russian in utf-8 encoded Unicode ? I might play around with this idea.
>>
>>Martin
>>
>>
>>
>>
>>_______________________________________________
>>Snowball-discuss mailing list
>>Snowball-discuss@lists.tartarus.org
>>http://lists.tartarus.org/mailman/listinfo/snowball-discuss
>>
>>  
>>
>
>-- 
>? ?????????, ?.??????.
>
>
>