[Xapian-discuss] indexing and queryparsing: UTF-8 and PHP

Sun Feb 26 00:57:51 GMT 2006

On Sat, Feb 25, 2006 at 11:54:51AM -0500, tata 668 wrote:
> 1) Am I correct when I say that Xapian doesn't provide an indexer function? 
> I mean, from what I understand, the only way to index a text in Xapian is 
> to split it, word by word, *by ourself*, and then to insert, one by one, 
> those words in Xapian using Document::add_term(). There are no Xapian 
> function that would take a whole text, splits the words by itself and 
> indexes them, right?

Not currently, but it's on my list.  As you suggest, it's a bit odd that 
there's a "parser for queries" component, but no matching "parser for
indexing document text" component.

> (And I don't think I want to look at 
> Omega because I do not indexe webpages, I'm using Xapian to indexe some 
> custom text inside my application, to provide a fast plain-text search 
> functionality.)

Omega's omindex indexer assumes you're indexing webpages (or documents
in a web server tree).  However Omega's scriptindex indexer is a good
fit for what you want to do - it takes a "dump file" (or several) which
is essentially groups of NAME=VALUE pairs, and another file which
describes what to do for each NAME.  One possible action is to split
VALUE into terms.  Currently iso-8859-1 input is assumed by the word
splitting though.

> 2) My second question is related to the queryparser. I've heard that UTF-8 
> support is not yet available in release versions. I'm not a C or C++ 
> programmer so I'd prefere not to mess with patches ( 
> http://thread.gmane.org/gmane.comp.search.xapian.general/1925 ).

Patching is really easy:

patch -p0 < PATCHFILE

And then configure and build as usual.

> Here's my question: I don't understand how you can use your own parsing 
> method for indexing (see question #1) AND use the provided Xapian 
> queryparser (even if it would support UTF-8)! Am I missing something or 
> both sides (the indexing and the queryparsing) have to use the same 
> splitting algorithm if you want the results to be correct.

Both indexing and query parsing do indeed need to have compatible
strategies for identifying terms.  

And currently to use the utf-8 QueryParser you have to implement a
compatible tokeniser for indexing.

Perhaps I should explain that there's an interrelated collection of
things I'm planning for what will probably be numbered 1.0.  I've
mentioned that I'm going to work on most of these before, but not
actually put all the pieces together like this before.

Most of these will result in databases built by pre-1.0 not being
reliably searchable by post-1.0, and vice versa, which is why I want to
do them all together at a major version change.  You've touched on a
number of them:

* update to the latest snowball stemmers (which support utf-8).

* clean up and apply the utf-8 patch for QueryParser.

* allowing more control over what QueryParser treats as a word character
  (and tweak the defaults to avoid generating phrase searches in cases
  where we don't need to - for example: 2.4.1 is currently a 3 term
  phrase query, and a slow case).

* remove the "accent normalisation" code - for any languages where it
  is desirable, it would be better done by incorporating it into the
  stemming algorithm if it isn't done there already.  The reason why
  it's done separately is historical (Xapian's proprietary precursor
  expected accents to be represented in its own special way, and it was
  easier to normalise them than to translate them!)

* fix the routines from indextext.cc used by omindex/scriptindex to
  handle utf-8 text (hmm, I've already done this for the gmane indexer!)

* add word character configurability to the indextext.cc routines to
  match that of the QueryParser and make available in the core library.

* fix the $highlight command in Omega to handle utf-8 and the
  configurable definitions of what a word is.

* fix omindex to use utf-8 (and convert input from documents in other
  character sets).

Before you ask, I don't have a date for 1.0 yet.  I suspect we'll want
at least one more 0.9.X first, to collect up any bug fixes, especially
since upgrading to 1.0 will be a bigger deal than usual, because it will
require a reindex for many users.

Cheers,
    Olly