[Xapian-discuss] Search queries with wildcards

Wed Dec 15 10:46:32 GMT 2004

On Wed, Dec 15, 2004 at 11:06:45AM +0100, Timo Haberkern wrote:

> >A thought: this is perhaps impractical because of dictionary sourcing
> >issues (and management, too, come to think of it), but you could look
> >for compound prefixes while turning longer words into terms
>
> You are right with that, but the problem is how to detect the acceptabe 
> fragments of a compound word. The application has to index technical 
> documents and there are many, many, many (...) compound words that never 
> occure in any dictionary. At the moment i don't see a practical way to 
> solve this problem as you described. Or do you have an idea to do so?

Not really, I'm afraid :-)

I was thinking that, in the same idea as stemming not being perfect,
you'd have a dictionary of things that you care about occuring in
compound words, and consider them even when they occur in compounds
you've never seen before. So (in English, as my German's not so hot
:-):

"Code" is a word on our list, so "Codemonkey" would be split and
indexed as "Code" and "monkey". However "monkey" may not be (you could
use a specialist particle list, or generate it automatically), so
"Monkeynuts" wouldn't be split (unless "nuts" was in your particle
list).

Now, you can create a particle list by hand, but that's a pain. If
it's a specialised application (say a search engine for technical
documentation) then you may have some success in generating it from
the database. Do an index pass without splitting, then get all words
out of the database and start looking for terms that occur in other
terms - as someone else pointed out, from a semantic point of view
you're often most interested in the head word, which will usually be
at the left or right of a word. If you want to search them everywhere,
it gets more interesting, and the algorithm is left as an exercise to
the reader :-)

(If you're lucky enough to work in a language where the head word is
the leftmost particle in the compound word, this becomes a lot easier
and you can do it without ripping all the terms out of the
database. I don't think Germanic languages tend to work like this,
though ... of course, you could always store your terms reversed in
the database.)

> Don't be sure if i understood that right. Is the only possible way to 
> implement wildcards that i have to store all possible substrings in the 
> index database?? So if i have the word "car" i need to store in the 
> database:
> 
> - "c"
> - "ca"
> - "car"
> 
> for doing a simple "c*" wildcardsearch?

As I understand it, yes. It's going to screw up your probabilities,
too, and hence your result ranking, because "c" will be quite a common
term ...

> Isn't there the possibility to extend the search-module for doing a 
> wildcard-search over the index-database?

I don't know the insides of the matcher, but I'd think it should be
possible. It would need a new Xapian::Query::op type, and a
corresponding PostList type in the matcher, I'd think ... it should be
able to solve the ranking problem, because it would be matching
against the actual term (and hence the wdf etc. would be right).

J

-- 
/--------------------------------------------------------------------------\
  James Aylett                                                  xapian.org
  james at tartarus.org                               uncertaintydivision.org