[Xapian-discuss] Python bindings and unicode strings
James Aylett
james-xapian at tartarus.org
Tue Sep 4 11:20:50 BST 2007
On Tue, Sep 04, 2007 at 12:50:17AM -0400, Deron Meranda wrote:
> Hmm. Clarity is rather important. I suspect this may just need
> some additional documentation (Or maybe it's there? Xapian does
> have a lot of technical documentation, but it's a bit scattered)
It's somewhere, yes. We have plans to sort out the documentation
properly, but there have been time constraints preventing people from
looking at it.
> Obviously it makes sense for Stem to work with Unicode, since it
> must deal with written languages. It gets a bit more clouded beyond
> that. Is the core intentionally designed to allow indexing
> arbitrary binary stuff, or is that just a side-effect of it not
> making any assumptions or trying to interpret the bytes in any way?
The core is designed for terms, data etc. to be binary.
> Basically all parts of Xapian, as well as users of it must agree
> whether things are raw bytes or UTF-8 strings. It can't really be
> both, safely anyway.
Then they are raw bytes. If a user wants to use UTF-8, they are free
to do so, but the storage is raw bytes. That isn't going to change,
because UTF-8 is inappropriate in various situations.
J
--
/--------------------------------------------------------------------------\
James Aylett xapian.org
james at tartarus.org uncertaintydivision.org
More information about the Xapian-discuss
mailing list