[Xapian-discuss] Trouble with German language indexing/searching

Olly Betts olly at survex.com
Wed Feb 15 20:01:42 GMT 2006


On Wed, Feb 15, 2006 at 11:51:29AM -0500, Jim Lynch wrote:
> OK, not entirely.  When I search for für using Omega, the term that gets 
> returned in the resultant xml is
> <queryterm term="fuer" show="fuer" freq="17"/>
> 
> I'm using a simple script to generate contextual samples and obviously 
> it doesn't work.  So where do I go to tell Xapian that I've got an 
> extended character set?

Currently the QueryParser performs transliteration of accented
characters (assuming character set iso-8859-1), and this is done 
even when stemming is disabled.  In this case, "u-umlaut" is converted
to "ue".

This has been discussed before a few times, for example:

http://thread.gmane.org/gmane.comp.search.xapian.general/1815

I'm planning to revisit this area before 1.0.  I suspect that I'll
remove the transliteration, and any that makes sense to keep will
be pushed into the stemmers (since it's a form of normalisation)

Meanwhile, it's not hard to disable if you're happy to run a patched
version of xapian (I thought I'd sent such a patch to the list but I
can't find it right now).

Cheers,
    Olly



More information about the Xapian-discuss mailing list