[Xapian-discuss] Python bindings and unicode strings

James Aylett james-xapian at tartarus.org
Tue Sep 4 11:20:50 BST 2007


On Tue, Sep 04, 2007 at 12:50:17AM -0400, Deron Meranda wrote:

> Hmm.  Clarity is rather important.  I suspect this may just need
> some additional documentation (Or maybe it's there? Xapian does
> have a lot of technical documentation, but it's a bit scattered)

It's somewhere, yes. We have plans to sort out the documentation
properly, but there have been time constraints preventing people from
looking at it.

> Obviously it makes sense for Stem to work with Unicode, since it
> must deal with written languages.  It gets a bit more clouded beyond
> that.  Is the core intentionally designed to allow indexing
> arbitrary binary stuff, or is that just a side-effect of it not
> making any assumptions or trying to interpret the bytes in any way?

The core is designed for terms, data etc. to be binary.

> Basically all parts of Xapian, as well as users of it must agree
> whether things are raw bytes or UTF-8 strings.  It can't really be
> both, safely anyway.

Then they are raw bytes. If a user wants to use UTF-8, they are free
to do so, but the storage is raw bytes. That isn't going to change,
because UTF-8 is inappropriate in various situations.

J

-- 
/--------------------------------------------------------------------------\
  James Aylett                                                  xapian.org
  james at tartarus.org                               uncertaintydivision.org



More information about the Xapian-discuss mailing list