[Snowball-discuss] 8-bit and 16-bit characters support

James Aylett james@tartarus.org
Wed Jun 4 10:21:02 2003


On Wed, Jun 04, 2003 at 02:20:04AM -0600, Martin Porter wrote:

> I said 2 or 3 byte characters because in utf-8, a character value above 127
> packs into either 2 or 3 bytes. Is that not so?

It actually packs into up to six octets. Unicode originally had a
16-bit code point space (UTF-8 for that always fits into at most three
octets). Eventually they ran out, and extended the codepoint space. In
UTF-16 this is accessed using 'surrogate pairs' (which are fairly
vile, IMHO :-), but UTF-8 will just encode them directly using more
octets.

Actually, Unicode 4.0 has codepoints U+0000 to U+10 FFFF [Unicode
website], but UTF-8 will encode up to U+7FFF FFFF [UTF-8 RFC 2279];
four octets of UTF-8 will suffice for the current codepoint range of
Unicode.

James

-- 
/--------------------------------------------------------------------------\
  James Aylett                                                  xapian.org
  james@tartarus.org                               uncertaintydivision.org