[Xapian-discuss] Xapian::Queryparser / Encoding Problem (Utf8)

Richard Boulton richard at tartarus.org
Wed Aug 10 15:29:15 BST 2005


On Tue, 2005-08-09 at 15:21 +0200, R. Mattes wrote:
> Well, the subject line says it all - what's the status 
> of the UTF-8 support in the query parser? I recall some
> messages in the list recently but haven't heard of any
> updates. This starts to be a major showstopper for our
> project (all data is in UTF-8 and I'd hate to have to
> rewrite the indexer to recode the data).
> I guess I could have a look at the lemon source but it
> has been a while since I last wrote lemon grammars (and
> never for c++).

I believe that there haven't been any updates since the last flurry of
messages on the list.  (But feel free to check the commit logs for the
relevant module.)

Part of the problem has been that the stemming algorithms used not to
support UTF-8 - however, the upstream algorithms (at
http://snowball.tartarus.org/) now support this quite happily.  However,
other changes to the output of the stemmers have also occurred since the
algorithms were imported into the Xapian source tree, so updating the
algorithms has been waiting for a major release (since changing the
stemming algorithms will force all databases to be rebuilt with the new
algorithms).  That said, don't let that stop you taking a look at the
work, and changing them locally (and submitting a patch...)

The query parser itself shouldn't need too much work - you'll probably
need to look at the accent normalising code (see accentnormalisingitor.h
and symboltab.h).

Oh, and note that the very latest english stemming algorithm from
snowball makes use of apostophe characters if it's presented with them,
so it would be good to stop stripping them out of the input to the
stemmer, if the language is english.

-- 
Richard Boulton <richard at tartarus.org>




More information about the Xapian-discuss mailing list