[Xapian-discuss] Future of Xapian (long)

Richard Boulton richard@lemurconsulting.com
Mon, 21 Jun 2004 12:30:49 +0100


Thanks for the detailed reply.  I should probably make clear that we 
weren't thinking that many of these ideas would be integrated into the 
core of Xapian - rather, we were thinking that they would be higher 
level components.  These would be part of the Xapian project in the same 
way as the omega and cvssearch projects are.

Francis Irving wrote:
>>b. A web server for Xapian.
> Why?  Sounds like an actually bad idea to me.

I should rephrase this: "An application server providing an interface to 
Xapian over HTTP".  (This would probably also use XML to format the 
returned data.) The idea of this is that it allows a web application 
developer to talk to a Xapian system, possibly running on a remote 
machine, using a standard protocol which is supported in many languages. 
  The idea is certainly not that the Xapian webserver would be exposed 
directly to the public internet.

The idea comes in part from observing that a lot of commercial search 
engines provide this type of interface.  It would allow easy integration 
of Xapian into a servlet architecture - also popular in (parts of) the 
commercial world.  It would also allow, for example, a PHP developer to 
access a remote Xapian server without requiring a customised PHP 
installation.

Why use HTTP for this?  Mainly because many languages now have support 
for HTTP built in, so integration is extremely easy.

>>c. A summarizer/highlighter component; we've noticed that TheyWorkForYou.org
>>have this already but we also have some code to do this.
> 
> Yes we do.  This is relatively straightforward with a simple search,
> but much harder with stemming (which we don't do at the moment).
> 
> QueryParser may be one place to put this, although would be good to
> be able to do it with any query.

I don't think it's a good idea to merge it with the QueryParser - I 
can't think of any code that they share, and they have do quite 
different things.

 > Would like two function:
> 
> 1. Takes a document, a query and a required excerpt length.  Function
> returns a suggested place for excerpt to begin and end (not breaking
> words in half).  I talked to Olly about this in the pub the other
> week.  It would scan for the window containing the largest bulk of
> relevant terms.  This means that if you have several words together
> at the end of the document, that would be returned, rather than one
> word at the start.

It would also be reasonable to return an excerpt made from two or more 
parts of the document, if the requested excerpt length is long enough.
This would work by splitting the document into phrases (by observing 
punctuation), and scanning for the phrase with the largest number of 
relevant terms - or possibly the largest weighted sum of terms.  This is 
how the code we currently have works.

> 2. Takes a document, a query and a highlight prefix/suffix.  Returns
> the document with a highlighting.  Bonus feature is different colours
> for different search terms (like the Google cache does).

Adding stemming support for this isn't too hard - it just requires 
parsing each word, stemming it, and checking if it is the same as a 
stemmed query term.

>>d. A spellchecker (like Google's 'did you mean xxx') using edit distance
>>calculation.
> 
> That would be nice to have.  How hard is that to do?  Would it be
> fast?  What extra information do you index to do it?

It's not terribly hard, though it would probably need a lot of tuning to 
work efficiently on a 400,000+ document database.  The simplest method 
is to store a database of n-grams (n=3, probably) for each term in the 
database, and use this to produce a preliminary set of possible 
spellings.  Then, take the top 100 or so, and do an edit distance 
calculation to calculate the best spelling.  This requires quite a lot 
of extra storage, of course.

I first heard of this technique being described by someone who had 
implemented it for a database of mathematical articles - he had around 
100,000 documents, and had got it working acceptably fast, whatever that 
means, so I think it's a viable technique.

>>e. A web spider
> 
> Why?  The main benefit of Xapian is that it doesn't spider, but can be
> used to search structured data in a database.  I haven't looked at
> Omega, but a separate project like that seems more the place to put a
> spider.  Not as part of Xapian itself.

Indeed, not as part of Xapian itself.  What we were wondering was how 
many people would find it useful to have an application available which 
performed web spidering, and stored the results in a database.  (My 
suspicion is that few people on the Xapian list would use one, but 
writing a spider would bring in a lot of new Xapian users.)

>>f. An easy(ier) way of plugging in the various open source file format
>>converters, for indexing Ms Office and other formats, with a list of which
>>ones actually work!
> 
> Again I see this as being higher level than the core Xapian
> information retrieval API.

Agreed.

-- 
Richard