[Snowball-discuss] 8-bit and 16-bit characters support
James Aylett
james@tartarus.org
Wed Jun 4 10:21:02 2003
On Wed, Jun 04, 2003 at 02:20:04AM -0600, Martin Porter wrote:
> I said 2 or 3 byte characters because in utf-8, a character value above 127
> packs into either 2 or 3 bytes. Is that not so?
It actually packs into up to six octets. Unicode originally had a
16-bit code point space (UTF-8 for that always fits into at most three
octets). Eventually they ran out, and extended the codepoint space. In
UTF-16 this is accessed using 'surrogate pairs' (which are fairly
vile, IMHO :-), but UTF-8 will just encode them directly using more
octets.
Actually, Unicode 4.0 has codepoints U+0000 to U+10 FFFF [Unicode
website], but UTF-8 will encode up to U+7FFF FFFF [UTF-8 RFC 2279];
four octets of UTF-8 will suffice for the current codepoint range of
Unicode.
James
--
/--------------------------------------------------------------------------\
James Aylett xapian.org
james@tartarus.org uncertaintydivision.org