[Snowball-discuss] Unicode

Michael Schlenker schlenk@uni-oldenburg.de
Fri, 22 Feb 2002 19:03:54 +0100

At 10:35 22.02.2002 -0700, you wrote:

>Unicode is on its way, and I'll outline the proposed solution:
>For input I think I'll alter the syntax of hex strings so that internal
>spaces separate characters. This is not upwards-compatible, but I can't
>imagine anyone will be upset.
>So                     hex '0D0A'
>needs to be written    hex '0D 0A'
>or even                hex 'D A'    // leading zeroes can be omitted
>A new style 'decimal' will be introduced:
>                        decimal '13 10' // cr lf
>Then we allow all values from 0 to 64K-1. values >= 64K produce an error
>For output, the java case is not a problem, since strings are made up of 16
>bit items anyway.
>For ANSI C we'll have 3 output styles:
>1) 8 bit characters, when reference to a character > 255 is an error. This
>is the default style for output in the ANSI C case, and is what we have at
>the moment.
>2) 16 bit characters. The way to get this is to declare all strings in the 
>     static symbol string_37[] = {'f','r','e','d'};
>     static symbol string_38[] = {'h','a','r','r','y'};
>     ...
>and we typedef symbol to 'unsigned short'. When it is typedeffed to
>'unsigned char' we get case (1) again. Of course any of the characters 'f'
>etc may be replaced by a number > 255 to get Unicode characters.
>3) UTF-8 encoded 8 bit characters. I believe the only change to the
>generated C is that cursor movements of the form z->c++; and z->c--; need to
>be replaced by function calls that move over 1,2 or 3 bytes to get to the
>next character.
>- - - - - -
>Does anyone know of a program of the form
>      convert <input >output -option
>where option could be ISOLatin1_to_Unicode, Unicode_to_Windows etc etc? I'll
>have to put something together like this for test purposes.
It's trivial with tcl/tk 8.1 and up (8.3.4 is the recent stable version), 
they are fully unicode aware.

Just use:
# should do some simple option processing here, if anyones interested its 
trivial, for now just assume argv1 and argv2 are the options needed

package require Tcl 8.1                         ;# needs tcl 8.1 +
set inputenc [lindex  $argv 1]                  ;# get inputencoding
set outputenc [lindex  $argv 2]                         ;# get outputencoding
fconfigure stdin -encoding $inputenc            ;# configure stdin to use 
fconfigure stdout -encoding $outputenc          ;# configure stdout to use 
fcopy stdin stdout                              ;# copy stdin to stdout
convert <infile >outfile latin-1 utf-8

To get the supported encodings:
$ tclsh83
% encoding names
cp860 cp861 cp862 cp863 tis-620 cp864 cp865 cp866 gb12345 cp949 cp950 cp869 
dingbats ksc5601 macCentEuro cp874 macUkraine jis0201 gb2312 euc-cn euc-jp 
macThai iso8859-10 jis0208 iso2022-jp macIceland iso2022 iso8859-13 jis0212 
iso8859-14 iso8859-15 cp737 iso8859-16 big5 euc-kr macRomania macTurkish 
gb1988 iso2022-kr macGreek ascii cp437 macRoman iso8859-1 iso8859-2 
iso8859-3 macCroatian koi8-r iso8859-4 ebcdic iso8859-5 cp1250 macCyrillic 
iso8859-6 cp1251 macDingbats koi8-u iso8859-7 cp1252 iso8859-8 cp1253 
iso8859-9 cp1254 cp1255 cp850 cp1256 cp932 identity cp1257 cp852 macJapan 
cp1258 shiftjis utf-8 cp855 cp936 symbol cp775 unicode cp857

Should be enough for most cases.

Michael Schlenker

(p.s. if you have any questions, just ask me or in comp.lang.tcl . You can 
get tcl/tk from sourceforge http://www.sf.net/projects/tcl or from 
activestate http://tcl.activestate.com )

Snowball-discuss mailing list