[Snowball-discuss] Unicode

Fri, 22 Feb 2002 12:40:51 -0500

What does that mean for the programmer ? As a Python programmer
all unicode strings are internally encoded as UCS-2/4. Of course
it is possible to encode the strings to UFT-8. How does the API
change for different unicode encodings ?

> - - - - - -
>
> Does anyone know of a program of the form
>
>      convert <input >output -option
>
> where option could be ISOLatin1_to_Unicode, Unicode_to_Windows etc etc?
I'll
> have to put something together like this for test purposes.

GNU Recode ?
Python ?

Andreas

----- Original Message -----
From: "Martin Porter" <martin_porter@softhome.net>
To: "Snowball discuss" <snowball-discuss@lists.sourceforge.net>
Sent: Friday, February 22, 2002 12:35
Subject: [Snowball-discuss] Unicode

>
> Unicode is on its way, and I'll outline the proposed solution:
>
> For input I think I'll alter the syntax of hex strings so that internal
> spaces separate characters. This is not upwards-compatible, but I can't
> imagine anyone will be upset.
>
> So                     hex '0D0A'
> needs to be written    hex '0D 0A'
> or even                hex 'D A'    // leading zeroes can be omitted
>
> A new style 'decimal' will be introduced:
>
>                        decimal '13 10' // cr lf
>
> Then we allow all values from 0 to 64K-1. values >= 64K produce an error
> message.
>
> For output, the java case is not a problem, since strings are made up of
16
> bit items anyway.
>
> For ANSI C we'll have 3 output styles:
>
> 1) 8 bit characters, when reference to a character > 255 is an error. This
> is the default style for output in the ANSI C case, and is what we have at
> the moment.
>
> 2) 16 bit characters. The way to get this is to declare all strings in the
form
>
>     static symbol string_37[] = {'f','r','e','d'};
>     static symbol string_38[] = {'h','a','r','r','y'};
>     ...
>
> and we typedef symbol to 'unsigned short'. When it is typedeffed to
> 'unsigned char' we get case (1) again. Of course any of the characters 'f'
> etc may be replaced by a number > 255 to get Unicode characters.
>
> 3) UTF-8 encoded 8 bit characters. I believe the only change to the
> generated C is that cursor movements of the form z->c++; and z->c--; need
to
> be replaced by function calls that move over 1,2 or 3 bytes to get to the
> next character.
>
> - - - - - -
>
> Does anyone know of a program of the form
>
>      convert <input >output -option
>
> where option could be ISOLatin1_to_Unicode, Unicode_to_Windows etc etc?
I'll
> have to put something together like this for test purposes.
>
> Martin
>
>
>
> _______________________________________________
> Snowball-discuss mailing list
> Snowball-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/snowball-discuss
>

_______________________________________________
Snowball-discuss mailing list
Snowball-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/snowball-discuss