[Xapian-discuss] utf8 support

Olly Betts olly at survex.com
Sun Apr 10 14:03:32 BST 2005


On Sun, Apr 10, 2005 at 12:45:28PM -0400, info at bannershift.com wrote:
> I would like to know if xapian supports utf8.
> 
> It is possible to add document data in utf8 format ?
> 
> For example
> 
> documen.set_data(utf8_description);

The document data is just an opaque blob as far as the library is
concerned.  So you can put whatever you like in there.

However, omega (and omindex and scriptindex) impose a certain structure
on the document data - they use it to store a list of NAME=VALUE pairs,
one per line.

Two parts of the core library make character set assumptions currently
- the stemmers and query parser.  Both currently assume latin1.  The
the assumption isn't very deeply embedded, and it's something I plan
to fix:

http://www.xapian.org/cgi-bin/bugzilla/show_bug.cgi?id=30

It'll require tweaking Snowball to produce utf-8 stemmers - there was
some discussion on the Snowball list about this a few months ago:

http://thread.gmane.org/gmane.comp.search.snowball/668

Cheers,
    Olly



More information about the Xapian-discuss mailing list