[Xapian-discuss] Xapian::Queryparser / Encoding Problem (Utf8)

R. Mattes rm at seid-online.de
Wed Aug 10 15:41:41 BST 2005


On Wed, 2005-08-10 at 15:29 +0100, Richard Boulton wrote:

> I believe that there haven't been any updates since the last flurry of
> messages on the list.  (But feel free to check the commit logs for the
> relevant module.)

I was afraid of that - just wanted to make shure.

> Part of the problem has been that the stemming algorithms used not to
> support UTF-8 - however, the upstream algorithms (at
> http://snowball.tartarus.org/) now support this quite happily.  However,
> other changes to the output of the stemmers have also occurred since the
> algorithms were imported into the Xapian source tree, so updating the
> algorithms has been waiting for a major release (since changing the
> stemming algorithms will force all databases to be rebuilt with the new
> algorithms).  That said, don't let that stop you taking a look at the
> work, and changing them locally (and submitting a patch...)

Well, the stemmer is the lesser problem - i'd be happy iff at least
unstemmed terms would stay correct (and _not_ be truncated at the first
non-ASCII character :-/ ). 

> The query parser itself shouldn't need too much work - you'll probably
> need to look at the accent normalising code (see accentnormalisingitor.h
> and symboltab.h).

Well, looks like this will be my next task on the stack ...

> Oh, and note that the very latest english stemming algorithm from
> snowball makes use of apostophe characters if it's presented with them,
> so it would be good to stop stripping them out of the input to the
> stemmer, if the language is english.

Unfortunalely we are dealing with german data (where stemming is pretty
hard -- well, we even would have access to a great stemmer but it has
an 500MB+ memory< footprint and isn't reentrant ..).

 Thanks for your input 

  RalfD




More information about the Xapian-discuss mailing list