[Xapian-discuss] UTF-8 becomes glibberish in searches

Olly Betts olly at survex.com
Thu Oct 18 22:32:02 BST 2007


On Thu, Oct 18, 2007 at 12:47:21PM -0700, athlon athlonf wrote:
> I'm using dbi2omega and scriptindex to index a database with chinese
>  characters.
> Searches are done with php4-bindings.
> 
> While the index-file is in utf8, the results from the searches are
>  glibberish.
> 
> These characters (changed to htmlencoding for this message)
> ?????? becomes something like this: å??äº???ä¸

I just see "?" and inverse "?" here in mutt I'm afraid...

> What am I doing wrong here? Is it the indexing, or is it the searching?

You need to step through the process, checking that everything is OK
after each step.  It could be dbi2omega is wrong, or scriptindex, or
xapian itself, or the PHP bindings.

First of all, I'd run dbi2omega redirected to a file, and then see if
the UTF-8 is correct in that file.

>  How can I check if the database is indeed in utf-8?

Use the "delve" utility (in xapian-core, examples/delve) to look at the
terms for a few documents.

If both dbi2omega and the database look OK, then it's probably the PHP
bindings.  If you're writing the results as a web page, have you set
the character set of the webpage to UTF-8 correctly?  Check what your
web browser says its character set is.

Cheers,
    Olly



More information about the Xapian-discuss mailing list