[Snowball-discuss] Regarding a unicode version of Snowball

Martin Porter martin_porter@softhome.net
Wed, 28 Nov 2001 01:18:40 -0700


Archie,

I should point out that we have something rather less than a "development
team". We have me, and Richard Boulton who mainly helps with the Web site :-)

Yes, it says in the manual that "at some point Unicode characters will have
to be supported". I have given this some thought since receiving your email,
but before going further would like to ask you: which do you think is a more
convenient representation (not just for you but for Unicode users
generally)? (a) Two bytes per character, so that 'char *' is replaced by
'short *', and you are still handling an array of characters, although the
size of the elements in the array has changed, or (b) a UTF-8 encoded form,
where characters below 128 are held in 1 byte, and other characters are held
in a variable number of bytes? 

In the case of (a), which way round are the bytes? I assume the more
significant is first, so "ab" would become "\0" "a" "\0" "b".

Martin


_______________________________________________
Snowball-discuss mailing list
Snowball-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/snowball-discuss

_____________________________________________________________________
VirusChecked by the Incepta Group plc
_____________________________________________________________________