[Xapian-discuss] indexing and queryparsing: UTF-8 and PHP

Sun Feb 26 02:35:17 GMT 2006

Thanks for all those answers and thanks for Xapian, my application now 
depends on it! ;-)

And, again, if someone knows a robust way to extract the words from a UTF-8 
string in PHP 4, I would be really grateful! I currently use (with the 
mbstring extension):

$words = mb_split("\W", $text) ;

Jules

----- Original Message ----- 
From: "Olly Betts" <olly at survex.com>
To: "tata 668" <tata668 at gmail.com>
Cc: <xapian-discuss at lists.xapian.org>
Sent: Saturday, February 25, 2006 7:57 PM
Subject: Re: [Xapian-discuss] indexing and queryparsing: UTF-8 and PHP

> On Sat, Feb 25, 2006 at 11:54:51AM -0500, tata 668 wrote:
>> 1) Am I correct when I say that Xapian doesn't provide an indexer 
>> function?
>> I mean, from what I understand, the only way to index a text in Xapian is
>> to split it, word by word, *by ourself*, and then to insert, one by one,
>> those words in Xapian using Document::add_term(). There are no Xapian
>> function that would take a whole text, splits the words by itself and
>> indexes them, right?
>
> Not currently, but it's on my list.  As you suggest, it's a bit odd that
> there's a "parser for queries" component, but no matching "parser for
> indexing document text" component.
>
>> (And I don't think I want to look at
>> Omega because I do not indexe webpages, I'm using Xapian to indexe some
>> custom text inside my application, to provide a fast plain-text search
>> functionality.)
>
> Omega's omindex indexer assumes you're indexing webpages (or documents
> in a web server tree).  However Omega's scriptindex indexer is a good
> fit for what you want to do - it takes a "dump file" (or several) which
> is essentially groups of NAME=VALUE pairs, and another file which
> describes what to do for each NAME.  One possible action is to split
> VALUE into terms.  Currently iso-8859-1 input is assumed by the word
> splitting though.
>
>> 2) My second question is related to the queryparser. I've heard that 
>> UTF-8
>> support is not yet available in release versions. I'm not a C or C++
>> programmer so I'd prefere not to mess with patches (
>> http://thread.gmane.org/gmane.comp.search.xapian.general/1925 ).
>
> Patching is really easy:
>
> patch -p0 < PATCHFILE
>
> And then configure and build as usual.
>
>> Here's my question: I don't understand how you can use your own parsing
>> method for indexing (see question #1) AND use the provided Xapian
>> queryparser (even if it would support UTF-8)! Am I missing something or
>> both sides (the indexing and the queryparsing) have to use the same
>> splitting algorithm if you want the results to be correct.
>
> Both indexing and query parsing do indeed need to have compatible
> strategies for identifying terms.
>
> And currently to use the utf-8 QueryParser you have to implement a
> compatible tokeniser for indexing.
>
> Perhaps I should explain that there's an interrelated collection of
> things I'm planning for what will probably be numbered 1.0.  I've
> mentioned that I'm going to work on most of these before, but not
> actually put all the pieces together like this before.
>
> Most of these will result in databases built by pre-1.0 not being
> reliably searchable by post-1.0, and vice versa, which is why I want to
> do them all together at a major version change.  You've touched on a
> number of them:
>
> * update to the latest snowball stemmers (which support utf-8).
>
> * clean up and apply the utf-8 patch for QueryParser.
>
> * allowing more control over what QueryParser treats as a word character
>  (and tweak the defaults to avoid generating phrase searches in cases
>  where we don't need to - for example: 2.4.1 is currently a 3 term
>  phrase query, and a slow case).
>
> * remove the "accent normalisation" code - for any languages where it
>  is desirable, it would be better done by incorporating it into the
>  stemming algorithm if it isn't done there already.  The reason why
>  it's done separately is historical (Xapian's proprietary precursor
>  expected accents to be represented in its own special way, and it was
>  easier to normalise them than to translate them!)
>
> * fix the routines from indextext.cc used by omindex/scriptindex to
>  handle utf-8 text (hmm, I've already done this for the gmane indexer!)
>
> * add word character configurability to the indextext.cc routines to
>  match that of the QueryParser and make available in the core library.
>
> * fix the $highlight command in Omega to handle utf-8 and the
>  configurable definitions of what a word is.
>
> * fix omindex to use utf-8 (and convert input from documents in other
>  character sets).
>
> Before you ask, I don't have a date for 1.0 yet.  I suspect we'll want
> at least one more 0.9.X first, to collect up any bug fixes, especially
> since upgrading to 1.0 will be a bigger deal than usual, because it will
> require a reindex for many users.
>
> Cheers,
>    Olly