[Snowball-discuss] 8-bit and 16-bit characters support

Eugen Bushuev bu@cisarte.com
Wed Jun 4 13:32:01 2003


--------------000205090409070807000605
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit

btw, cyrillic letters in utf-8:

stringdef a    decimal '45264'
stringdef b    decimal '45520'
stringdef v    decimal '45776'
stringdef g    decimal '46032'
stringdef d    decimal '46288'
stringdef e    decimal '46544'
stringdef zh   decimal '46800'
stringdef z    decimal '47056'
stringdef i    decimal '47312'
stringdef i`   decimal '47568'
stringdef k    decimal '47824'
stringdef l    decimal '48080'
stringdef m    decimal '48336'
stringdef n    decimal '48592'
stringdef o    decimal '48848'
stringdef p    decimal '49104'
stringdef r    decimal '32977'
stringdef s    decimal '33233'
stringdef t    decimal '33489'
stringdef u    decimal '33745'
stringdef f    decimal '34001'
stringdef kh   decimal '34257'
stringdef ts   decimal '34513'
stringdef ch   decimal '34769'
stringdef sh   decimal '35025'
stringdef shch decimal '35281'
stringdef "    decimal '36049'
stringdef y    decimal '35793'
stringdef '    decimal '35537'
stringdef e`   decimal '36305'
stringdef iu   decimal '36561'
stringdef ia   decimal '36817'



Martin Porter wrote:

>>Return-Path: <bu@lucky.net>
>>Delivered-To: martin_porter@SoftHome.net
>>Date: Wed, 04 Jun 2003 10:08:30 +0300
>>From: Eugen Bushuev <bu@lucky.net>
>>User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0.2)
>>    
>>
>Gecko/20021120 Netscape/7.01
>  
>
>>X-Accept-Language: ru, en-us
>>To: Martin Porter <martin_porter@SoftHome.net>
>>CC: Oleg Bartunov <oleg@sai.msu.su>
>>Subject: Re: [Snowball-discuss] 8-bit and 16-bit characters support
>>References: <courier.3EDD8B20.00003D70@softhome.net>
>>X-Verify-Sender: verified
>>
>>Hi.
>>This question was risen by me. You can get bit of russian utf-8 text at 
>>http://ox.carrier.kiev.ua/~bu/test/fetch/novosti_utf8.html.
>>
>>About you advice - i can't find neither in russian or english .sbl 
>>something similar "goto v[owel]" directives. I tried to add current 
>>character size to the SN_env structure and replace sizeof(symbol) with 
>>z->sizeOfChar in all memory allocation procedures. Also i tried to play 
>>with incrementing z->c, but it gave me nothing since i alsmost don't 
>>understood how does it work.
>>
>>I'm trying to make it out because i need tsearch to work with UTF-8. 
>>UTF-8 is used because postgres uses it as "Unicode", and besides this i 
>>need to process data in several languagies, at least English, Russian 
>>and Ukrainian.
>>
>>And, btw, why 2 and 3 characters? I thought that english text uses 1 
>>byte and russian - 2 bytes...
>>
>>Martin Porter wrote:
>>
>>    
>>
>>>Oleg,
>>>
>>>No, Snowball is either set up for 1 byte character use, or 2 byte character
>>>use, but it has occurred to me that implementing the stemmers on utf-8 data
>>>may not be so difficult, even with no changes to the Snowball compiler.
>>>
>>>If you treat utf-8 data as a pure byte stream of characters (so one utf-8
>>>character corresponds to 2 or 3 bytes) the stemmers almost work, but the
>>>thing that goes wrong is the single character tests for characters in a
>>>certain class. So one would have to replace
>>>
>>>   goto vowel  // vowel defined by 'define vowel '...'
>>>
>>>by
>>>
>>>   goto among ('a' 'e' 'i' 'o' 'u')
>>>
>>>or more precisely
>>>
>>>   goto among ('[a]' '[e]' ... )
>>>
>>>where [a] etc are macros defining the vowels as utf encoded byte sequences.
>>>
>>>Perhaps that is how all the stemmers should have been written.
>>>
>>>Can you point me to some plain text somewhere in the web that gives a bit of
>>>russian in utf-8 encoded Unicode ? I might play around with this idea.
>>>
>>>Martin
>>>
>>>
>>>
>>>
>>>_______________________________________________
>>>Snowball-discuss mailing list
>>>Snowball-discuss@lists.tartarus.org
>>>http://lists.tartarus.org/mailman/listinfo/snowball-discuss
>>>
>>> 
>>>
>>>      
>>>
>>-- 
>>? ?????????, ?.??????.
>>
>>
>>
>>    
>>
>
>
>
>_______________________________________________
>Snowball-discuss mailing list
>Snowball-discuss@lists.tartarus.org
>http://lists.tartarus.org/mailman/listinfo/snowball-discuss
>
>  
>

-- 
? ?????????, ?.??????.


--------------000205090409070807000605
Content-Type: text/html; charset=us-ascii
Content-Transfer-Encoding: 7bit

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
  <title></title>
</head>
<body>
btw, cyrillic letters in utf-8:<br>
<br>
stringdef a&nbsp;&nbsp;&nbsp; decimal '45264'<br>
stringdef b&nbsp;&nbsp;&nbsp; decimal '45520'<br>
stringdef v&nbsp;&nbsp;&nbsp; decimal '45776'<br>
stringdef g&nbsp;&nbsp;&nbsp; decimal '46032'<br>
stringdef d&nbsp;&nbsp;&nbsp; decimal '46288'<br>
stringdef e&nbsp;&nbsp;&nbsp; decimal '46544'<br>
stringdef zh&nbsp;&nbsp; decimal '46800'<br>
stringdef z&nbsp;&nbsp;&nbsp; decimal '47056'<br>
stringdef i&nbsp;&nbsp;&nbsp; decimal '47312'<br>
stringdef i`&nbsp;&nbsp; decimal '47568'<br>
stringdef k&nbsp;&nbsp;&nbsp; decimal '47824'<br>
stringdef l&nbsp;&nbsp;&nbsp; decimal '48080'<br>
stringdef m&nbsp;&nbsp;&nbsp; decimal '48336'<br>
stringdef n&nbsp;&nbsp;&nbsp; decimal '48592'<br>
stringdef o&nbsp;&nbsp;&nbsp; decimal '48848'<br>
stringdef p&nbsp;&nbsp;&nbsp; decimal '49104'<br>
stringdef r&nbsp;&nbsp;&nbsp; decimal '32977'<br>
stringdef s&nbsp;&nbsp;&nbsp; decimal '33233'<br>
stringdef t&nbsp;&nbsp;&nbsp; decimal '33489'<br>
stringdef u&nbsp;&nbsp;&nbsp; decimal '33745'<br>
stringdef f&nbsp;&nbsp;&nbsp; decimal '34001'<br>
stringdef kh&nbsp;&nbsp; decimal '34257'<br>
stringdef ts&nbsp;&nbsp; decimal '34513'<br>
stringdef ch&nbsp;&nbsp; decimal '34769'<br>
stringdef sh&nbsp;&nbsp; decimal '35025'<br>
stringdef shch decimal '35281'<br>
stringdef "&nbsp;&nbsp;&nbsp; decimal '36049'<br>
stringdef y&nbsp;&nbsp;&nbsp; decimal '35793'<br>
stringdef '&nbsp;&nbsp;&nbsp; decimal '35537'<br>
stringdef e`&nbsp;&nbsp; decimal '36305'<br>
stringdef iu&nbsp;&nbsp; decimal '36561'<br>
stringdef ia&nbsp;&nbsp; decimal '36817'<br>
<br>
<br>
<br>
Martin Porter wrote:<br>
<blockquote type="cite" cite="midcourier.3EDDAB90.00004675@softhome.net">
  <blockquote type="cite">
    <pre wrap="">Return-Path: <a class="moz-txt-link-rfc2396E" href="mailto:bu@lucky.net">&lt;bu@lucky.net&gt;</a>
Delivered-To: <a class="moz-txt-link-abbreviated" href="mailto:martin_porter@SoftHome.net">martin_porter@SoftHome.net</a>
Date: Wed, 04 Jun 2003 10:08:30 +0300
From: Eugen Bushuev <a class="moz-txt-link-rfc2396E" href="mailto:bu@lucky.net">&lt;bu@lucky.net&gt;</a>
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.0.2)
    </pre>
  </blockquote>
  <pre wrap=""><!---->Gecko/20021120 Netscape/7.01
  </pre>
  <blockquote type="cite">
    <pre wrap="">X-Accept-Language: ru, en-us
To: Martin Porter <a class="moz-txt-link-rfc2396E" href="mailto:martin_porter@SoftHome.net">&lt;martin_porter@SoftHome.net&gt;</a>
CC: Oleg Bartunov <a class="moz-txt-link-rfc2396E" href="mailto:oleg@sai.msu.su">&lt;oleg@sai.msu.su&gt;</a>
Subject: Re: [Snowball-discuss] 8-bit and 16-bit characters support
References: <a class="moz-txt-link-rfc2396E" href="mailto:courier.3EDD8B20.00003D70@softhome.net">&lt;courier.3EDD8B20.00003D70@softhome.net&gt;</a>
X-Verify-Sender: verified

Hi.
This question was risen by me. You can get bit of russian utf-8 text at 
<a class="moz-txt-link-freetext" href="http://ox.carrier.kiev.ua/~bu/test/fetch/novosti_utf8.html">http://ox.carrier.kiev.ua/~bu/test/fetch/novosti_utf8.html</a>.

About you advice - i can't find neither in russian or english .sbl 
something similar "goto v[owel]" directives. I tried to add current 
character size to the SN_env structure and replace sizeof(symbol) with 
z-&gt;sizeOfChar in all memory allocation procedures. Also i tried to play 
with incrementing z-&gt;c, but it gave me nothing since i alsmost don't 
understood how does it work.

I'm trying to make it out because i need tsearch to work with UTF-8. 
UTF-8 is used because postgres uses it as "Unicode", and besides this i 
need to process data in several languagies, at least English, Russian 
and Ukrainian.

And, btw, why 2 and 3 characters? I thought that english text uses 1 
byte and russian - 2 bytes...

Martin Porter wrote:

    </pre>
    <blockquote type="cite">
      <pre wrap="">Oleg,

No, Snowball is either set up for 1 byte character use, or 2 byte character
use, but it has occurred to me that implementing the stemmers on utf-8 data
may not be so difficult, even with no changes to the Snowball compiler.

If you treat utf-8 data as a pure byte stream of characters (so one utf-8
character corresponds to 2 or 3 bytes) the stemmers almost work, but the
thing that goes wrong is the single character tests for characters in a
certain class. So one would have to replace

   goto vowel  // vowel defined by 'define vowel '...'

by

   goto among ('a' 'e' 'i' 'o' 'u')

or more precisely

   goto among ('[a]' '[e]' ... )

where [a] etc are macros defining the vowels as utf encoded byte sequences.

Perhaps that is how all the stemmers should have been written.

Can you point me to some plain text somewhere in the web that gives a bit of
russian in utf-8 encoded Unicode ? I might play around with this idea.

Martin




_______________________________________________
Snowball-discuss mailing list
<a class="moz-txt-link-abbreviated" href="mailto:Snowball-discuss@lists.tartarus.org">Snowball-discuss@lists.tartarus.org</a>
<a class="moz-txt-link-freetext" href="http://lists.tartarus.org/mailman/listinfo/snowball-discuss">http://lists.tartarus.org/mailman/listinfo/snowball-discuss</a>

 

      </pre>
    </blockquote>
    <pre wrap="">-- 
? ?????????, ?.??????.



    </pre>
  </blockquote>
  <pre wrap=""><!---->


_______________________________________________
Snowball-discuss mailing list
<a class="moz-txt-link-abbreviated" href="mailto:Snowball-discuss@lists.tartarus.org">Snowball-discuss@lists.tartarus.org</a>
<a class="moz-txt-link-freetext" href="http://lists.tartarus.org/mailman/listinfo/snowball-discuss">http://lists.tartarus.org/mailman/listinfo/snowball-discuss</a>

  </pre>
</blockquote>
<br>
<div class="moz-signature">-- <br>
&#1057; &#1091;&#1074;&#1072;&#1078;&#1077;&#1085;&#1080;&#1077;&#1084;, &#1045;.&#1041;&#1091;&#1096;&#1091;&#1077;&#1074;. </div>
<br>
</body>
</html>

--------------000205090409070807000605--