[Xapian-discuss] TermGenerator question for the single quote character

tata 668 tata668 at gmail.com
Wed Apr 8 17:11:05 BST 2009


With the help of a Xapian mailing list user (you know who you are, 
thanks), I did this workaround:

Before inserting the text to index in the TermGenerator I preprocess it 
that way:

1. I find all left and right side of a ' or a ' that can be words (maybe 
my regex is not perfect yet though):

preg_match_all("|([\s][^\s''.;:!\?,»\"/\(\)\[\]]+)['']([^\s''.;:!\?,»\"/\(\)\[\]]+)|ui", 
$text, $matches);

2. Then I add the words I found at the end of the original text to 
index. Every added word is separated by a custom word delimiter 
(something like "||DEL||") to ensure two added words, side by side, 
wouln't be found as a phrase.

Example:

"Bozo l'éléphant aime prèsqu'Alice!"
would be changed to this, before indexation:
"Bozo l'éléphant aime prèsqu'Alice! ||DEL||  l ||DEL||  prèsqu ||DEL|| 
éléphant ||DEL|| Alice"


Any tips or ideas to improve this would be welcome!

Julien




tata 668 wrote:
> I tried it and can confirm that setting it to "french" doesn't help. 
> "m'excite" is still indexed as "m'excite" and not as "m" and "excite".
>
> If someone has an idea on how it could be fixed, it would be really 
> appreciated!
>
> Thank you,
>
> Julien
>
>
>
> tata 668 wrote:
>   
>> I found it: 
>> http://xapian.org/docs/apidoc/html/classXapian_1_1TermGenerator.html#f7d43aef10aa6b26ef853a0ae2695f83
>>
>> I'll try to set it to the french stremmer..
>>
>> Thanks
>>
>> Julien
>>
>>
>>
>> Olly Betts wrote:
>>   
>>     
>>> On Sun, Apr 05, 2009 at 07:18:08PM -0400, tata 668 wrote:
>>>   
>>>     
>>>       
>>>> I use the TermGenerator to index the french text "Cela m'excite" 
>>>> (without the quotes). When I do a search for "excite" after this 
>>>> indexation, I need it to be found. "excite" is a word on is own.
>>>>
>>>> Currently "excite" is not found but "m'excite" is...
>>>>     
>>>>       
>>>>         
>>> In 1.0.0, we changed to treating apostrophes as part of a word, and
>>> updated to a newer version of Snowball where the English stemmer
>>> deals with them.
>>>
>>> I think the correct way for this to work is for the other stemmers
>>> to also handle apostrophes (at least if their languages use them)
>>> as otherwise the word tokenisation required depends on the stemmer.
>>>
>>>   
>>>     
>>>       
>>>> Is there a setting I'm missing so that the single quote character act as 
>>>> a word delimiter?
>>>>     
>>>>       
>>>>         
>>> No, there's no such setting currently.
>>>
>>> Cheers,
>>>     Olly
>>>
>>>   
>>>     
>>>       
>> _______________________________________________
>> Xapian-discuss mailing list
>> Xapian-discuss at lists.xapian.org
>> http://lists.xapian.org/mailman/listinfo/xapian-discuss
>>
>>   
>>     
> _______________________________________________
> Xapian-discuss mailing list
> Xapian-discuss at lists.xapian.org
> http://lists.xapian.org/mailman/listinfo/xapian-discuss
>
>   



More information about the Xapian-discuss mailing list