[Xapian-discuss] indexing and queryparsing: UTF-8 and PHP
tata 668
tata668 at gmail.com
Sat Feb 25 21:25:52 GMT 2006
But if the Xapian queryparser doesn't currently support UTF-8 that imply two
possibilities
1) The indexers from the Omega project don't support UTF-8 either
or
2) The Xapian queryparser and the indexers from Omega don't use the same
algorithms to split strings into words!
My problem is still present: I want to be sure the words indexed are
separated the same way the words from the querystrings will!
Therefore I guess the best solution for now if to write you own queryparser
and your own indexer, both using the SAME algorithm to split words.
If I take that solution the only problem remaining is to find a bullet proof
way to split UTF-8 in PHP.
----- Original Message -----
From: "Jim Lynch" <jim at fayettedigital.com>
To: <xapian-discuss at lists.xapian.org>
Sent: Saturday, February 25, 2006 1:30 PM
Subject: Re: [Xapian-discuss] indexing and queryparsing: UTF-8 and PHP
> I'm using a combination of scriptindex and omega to index german language
> texts and the words do not split on accented characters. E. g.
> höchstpersönlichen remains höchstpersönlichen and a search for it finds it
> fine. What does happen is that xapian does transliterate the accented
> characters into diagraphs but since these are unique it does't make any
> difference unless you want to use the term list that is returned for
> something.
> Olly posted a patch recently to eliminate that behavior. While omega is a
> cgi program it does not mean you cannot use it to search a database and
> return results to a program. In fact, that's the way I'm using it myself.
> I use html2text to produce plain text and read the text in and format it
> in a way that scriptindex likes it. I then have my search program call
> omega to return a xml file to me with the results. I am using it in cgi
> mode, just 'cause that is convinient but I could have called it via a exec
> call just as easily.
> Hope that helps.
>
> Jim.
> tata 668 wrote:
>
>> Hi,
>>
>> It's my first message in this mailing list, I hope I'm sending it to the
>> correct address. I'm also new to Xapian and my english is not perfect.
>>
>> I test Xapian from PHP 4.4.1, using the bindings, and it works pretty
>> well. Thanks to everyone involved in this project!
>>
>> My questions:
>>
>> 1) Am I correct when I say that Xapian doesn't provide an indexer
>> function? I mean, from what I understand, the only way to index a text in
>> Xapian is to split it, word by word, *by ourself*, and then to insert,
>> one by one, those words in Xapian using Document::add_term(). There are
>> no Xapian function that would take a whole text, splits the words by
>> itself and indexes them, right? I have to write my own indexer, my own
>> string splitting function. Is that correct? (And I don't think I want to
>> look at Omega because I do not indexe webpages, I'm using Xapian to
>> indexe some custom text inside my application, to provide a fast
>> plain-text search functionality.)
>>
>> 2) My second question is related to the queryparser. I've heard that
>> UTF-8 support is not yet available in release versions. I'm not a C or
>> C++ programmer so I'd prefere not to mess with patches (
>> http://thread.gmane.org/gmane.comp.search.xapian.general/1925 ). But
>> anyway, I don't need full support for my queries so I wrote my own, UTF-8
>> aware, queryparser even if it's not perfect (see question #3).
>>
>> Here's my question: I don't understand how you can use your own parsing
>> method for indexing (see question #1) AND use the provided Xapian
>> queryparser (even if it would support UTF-8)! Am I missing something or
>> both sides (the indexing and the queryparsing) have to use the same
>> splitting algorithm if you want the results to be correct. If my indexing
>> algorithm splits "aaaÏbbb" into one word only ("aaaÏbbb") but the Xapian
>> queryparser doesn't considere "Ï" as an alphanumeric character and
>> therefore splits the string into two words ("aaa" and "bbb"), my search
>> results won't be correct, right? So I don't see how it is possible to
>> rely on a provided queryparser if there is no indexing function also
>> provided that would use the exact same splitting algorithm.
>>
>> 3) If someone has experience with splitting UTF-8 strings into words
>> using PHP 4, I would be really happy. I though mb_split("\W", $text) ;
>> would do the job but it seems that it considers some characters as
>> alphanumeric (ie: "´") where, I think, it shouldn't. Any help?
>>
>>
>> Thanks,
>>
>> Jules Landry
>>
>>
>>
>>
>>
>> _______________________________________________
>> Xapian-discuss mailing list
>> Xapian-discuss at lists.xapian.org
>> http://lists.xapian.org/mailman/listinfo/xapian-discuss
>>
>>
>>
>
>
> _______________________________________________
> Xapian-discuss mailing list
> Xapian-discuss at lists.xapian.org
> http://lists.xapian.org/mailman/listinfo/xapian-discuss
More information about the Xapian-discuss
mailing list