[Xapian-discuss] German Danish Russian

win 32 win32ster at gmail.com
Tue Sep 1 16:58:47 BST 2009


Hello,I searched the mailing list but all the language problems seemed to
disapper with UTF-8 support in Xapian 1.0. Still I can't figure it out, I'm
under Windows XP, CSharp and .NET framework 3.5
The TermGenerator.IndexText treats some characters as separators, for
example german 'ß' or danish 'æ' so it splits words with them into seperate
word parts while other letters are 'simplified' ö / ø -> o, ä -> a. Indexing
text in Russian result in a non-readable index list (retreived later with
iterating through Document.TermListBegin .. End)
I wrote my own indexer that doesn't split those words but saves them with
AddPosting method, still when they are read from the database there are '?'
(question signs) in places of 'ß' / 'æ' and o in place of ö / ø.
The other part of the problem is with the QueryParser which does the same
bad things to query terms. I searched xapian source code and found that it
requires UTF-8 encoded string as input, am I right ?

Query QueryParser::Internal::parse_query(const string &qs, unsigned
flags,  string &default_prefix) {
...
    Utf8Iterator it(qs), end;

So I made a small patch to the binding Query.cs / QueryParser.cs files to
allow me to override ParseQuery method so that I pass a utf-8 encoded string
to Xapian dll. With a debugger I trap process at dll
entrance _CSharp_QueryParser_ParseQuery__SWIG_1 and do make sure that passed
parameter is a UTF-8 encoded string. Still it doesn't help! Query term
iterator returns strange things instead of German/Danish/Russian characters
and search fails. Did someone manage the search to work in non-english
languages ?


More information about the Xapian-discuss mailing list