[Xapian-discuss] Search queries with wildcards

Timo Haberkern thaberkern at emedia-office.de
Wed Dec 15 10:21:57 GMT 2004


Hello,

rm at fabula.de wrote:

>On Wed, Dec 15, 2004 at 08:01:56AM +0100, Timo Haberkern wrote:
>  
>
>>A wild card search would be very great. In germany we have a lot of 
>>compound words. A pure stemmer base search didn't find a lot of matches. 
>>Think of the word "Fehlercode", if i use "Fehler" as a search query i 
>>wouldn't find the documents with Fehlercode in it, right? But i need 
>>such a solution. And wildcards seems to be the only solution.
>>
>>How can the wildcard search be done? Do you have to develop something 
>>for that?
>>    
>>
>
>Ah, so you indeed want to abuse wildcard search for proper indexing ;-)
>  
>
Ahmm, no i only want to have the possibility that the user of the search 
can search for word fragments :-) So i don't care for matches that 
haven't the correct semantic context (as you mentioned below). Maybe 
another example can bring some light in what i want:

There are Article-Nr. in the documents i want to index. For example

A1590-789
A1590-555
A6719-9911

Where the first 5 characters are an article-group identifier. The user 
should be capable to search for all documents with articles of an 
arcticle group. Therefore he should be able to use for exmpample the 
search query: "A1590*" or "A1590-*"

But: I don't want to search only for article numbers, the search fro 
fragments should be possible for simple word fragments too (as described 
in my last mail)

Thats what i want. Is there a way to do this in xapian?

Timo

>The proper way to do it: have your stemmer do all the hard work.
>If both "Fehler" and "Fehlercode" stem to the same stem there's no real
>problem (as long as this is not the only term in a query, but then, single
>word queries are rather bad for statistical IR ...). Unfortunately this
>does introduce some sematic problems: a "Fehlercode" (error code) isn't
>a "Fehler" (error) but a specific "Code".
>
>Another posibility would be to have the stemmer emit several component 
>terms ("Fehler" "Code") - as tempting as this might first seem (it _does_
>look more correct than the first solution) it bears similar semantic problems
>as the first solution. The "true" stem would be just "Code". 
>
>The Right Thing to do here is to introduce multiple ranked stems. Unfortunately
>there's no free/open source stemmer for your language of choice :-/
>A working stemmer for german needs do do some context analysis, a lot of
>morphological knowledge and a good (!) dictionary. Iff you need this for
>a commercial product i could point you in the right direction (no, i'm not
>affiliated with these sources :-)
>
> HTH Ralf Mattes
>
>  
>
>>regards
>>
>>Timo
>>
>>    
>>
>>>Cheers,
>>>  Olly
>>>
>>>
>>>
>>>
>>>      
>>>
>>_______________________________________________
>>Xapian-discuss mailing list
>>Xapian-discuss at lists.xapian.org
>>http://lists.xapian.org/mailman/listinfo/xapian-discuss
>>    
>>
>
>
>  
>



More information about the Xapian-discuss mailing list