[Xapian-discuss] Search queries with wildcards

rm at fabula.de rm at fabula.de
Wed Dec 15 10:53:08 GMT 2004


On Wed, Dec 15, 2004 at 11:21:57AM +0100, Timo Haberkern wrote:
> >
> >Ah, so you indeed want to abuse wildcard search for proper indexing ;-)
> > 
> >
> Ahmm, no i only want to have the possibility that the user of the search 
> can search for word fragments :-) 

Hmm, but  think there's one important thing to point out here (please, IR
experts/gurus correct me if i'm wrong): statistical IR is all about TERMs,
which is, in most cases an abstraction of "word". So what the engine
stores is information aboutr the precence or absence of a TERM in a document
(and statistics of the number of occurences etc.). 

> So i don't care for matches that 
> haven't the correct semantic context (as you mentioned below). Maybe 
> another example can bring some light in what i want:
> 
> There are Article-Nr. in the documents i want to index. For example
> 
> A1590-789
> A1590-555
> A6719-9911
> 
> Where the first 5 characters are an article-group identifier. The user 
> should be capable to search for all documents with articles of an 
> arcticle group. Therefore he should be able to use for exmpample the 
> search query: "A1590*" or "A1590-*"

That's what i feared :-) Such a technique is _often_ used in the database
world to "fake" a semantic query - and a lot of customers/users are so 
used to "prefix" searching that they'll expect all IR systems to work like 
this. IMHO there's a major design flaw here: A1590-789  encapsulates 
_two_ TERMS (Article-group is 'A1590' AND Article ID is '555'). Your stemmer
(yes, it looks like you need to write your own) needs to split up the string
"A1590-789" into the different components and stuff those into different
fields ("GR:1590" and "ID:789"). Then your users can query by fields and
get the full power of statistical ranking for text searches (sidenote:
your example is a good candidate for boolean queries (i.e. queries where
the answer is either "document does contain term" or "document doesn't contain
term") - xapian does handle these as well but it can do more (ranking etc.).


> 
> But: I don't want to search only for article numbers, the search fro 
> fragments should be possible for simple word fragments too (as described 
> in my last mail)
> 
> Thats what i want. Is there a way to do this in xapian?

Hard to do. That kind of query is hard to handle for all sorts of
indexing (even in a RDBMS where such stuff is handled better). As long
as you can restrict your queries to right _or_ left6  truncation (i.e.
either "Blah*" _or_ "*Blah") one can use a (b)tree/suffix-array etc. index
scheme, but if you need "*blah*" or "bl*h" you pretty much end up with a
full scan of all terms. No, Xapian isn't of too much help here.
One thing you _could_ do (but it's b*t-uggly): parse your query and
expand all wildcards to alternative terms from the index.
So a query for "par*" would trigger a scan over the index that expands the
query into "parent OR part OR partner OR ...".


 HTH RalfD




More information about the Xapian-discuss mailing list