[Xapian-discuss] Trouble with German language indexing/searching
Olly Betts
olly at survex.com
Wed Feb 15 20:01:42 GMT 2006
On Wed, Feb 15, 2006 at 11:51:29AM -0500, Jim Lynch wrote:
> OK, not entirely. When I search for für using Omega, the term that gets
> returned in the resultant xml is
> <queryterm term="fuer" show="fuer" freq="17"/>
>
> I'm using a simple script to generate contextual samples and obviously
> it doesn't work. So where do I go to tell Xapian that I've got an
> extended character set?
Currently the QueryParser performs transliteration of accented
characters (assuming character set iso-8859-1), and this is done
even when stemming is disabled. In this case, "u-umlaut" is converted
to "ue".
This has been discussed before a few times, for example:
http://thread.gmane.org/gmane.comp.search.xapian.general/1815
I'm planning to revisit this area before 1.0. I suspect that I'll
remove the transliteration, and any that makes sense to keep will
be pushed into the stemmers (since it's a form of normalisation)
Meanwhile, it's not hard to disable if you're happy to run a patched
version of xapian (I thought I'd sent such a patch to the list but I
can't find it right now).
Cheers,
Olly
More information about the Xapian-discuss
mailing list