[Xapian-discuss] Future of Xapian (long)
Richard Boulton
richard@lemurconsulting.com
Mon, 21 Jun 2004 12:30:49 +0100
Thanks for the detailed reply. I should probably make clear that we
weren't thinking that many of these ideas would be integrated into the
core of Xapian - rather, we were thinking that they would be higher
level components. These would be part of the Xapian project in the same
way as the omega and cvssearch projects are.
Francis Irving wrote:
>>b. A web server for Xapian.
> Why? Sounds like an actually bad idea to me.
I should rephrase this: "An application server providing an interface to
Xapian over HTTP". (This would probably also use XML to format the
returned data.) The idea of this is that it allows a web application
developer to talk to a Xapian system, possibly running on a remote
machine, using a standard protocol which is supported in many languages.
The idea is certainly not that the Xapian webserver would be exposed
directly to the public internet.
The idea comes in part from observing that a lot of commercial search
engines provide this type of interface. It would allow easy integration
of Xapian into a servlet architecture - also popular in (parts of) the
commercial world. It would also allow, for example, a PHP developer to
access a remote Xapian server without requiring a customised PHP
installation.
Why use HTTP for this? Mainly because many languages now have support
for HTTP built in, so integration is extremely easy.
>>c. A summarizer/highlighter component; we've noticed that TheyWorkForYou.org
>>have this already but we also have some code to do this.
>
> Yes we do. This is relatively straightforward with a simple search,
> but much harder with stemming (which we don't do at the moment).
>
> QueryParser may be one place to put this, although would be good to
> be able to do it with any query.
I don't think it's a good idea to merge it with the QueryParser - I
can't think of any code that they share, and they have do quite
different things.
> Would like two function:
>
> 1. Takes a document, a query and a required excerpt length. Function
> returns a suggested place for excerpt to begin and end (not breaking
> words in half). I talked to Olly about this in the pub the other
> week. It would scan for the window containing the largest bulk of
> relevant terms. This means that if you have several words together
> at the end of the document, that would be returned, rather than one
> word at the start.
It would also be reasonable to return an excerpt made from two or more
parts of the document, if the requested excerpt length is long enough.
This would work by splitting the document into phrases (by observing
punctuation), and scanning for the phrase with the largest number of
relevant terms - or possibly the largest weighted sum of terms. This is
how the code we currently have works.
> 2. Takes a document, a query and a highlight prefix/suffix. Returns
> the document with a highlighting. Bonus feature is different colours
> for different search terms (like the Google cache does).
Adding stemming support for this isn't too hard - it just requires
parsing each word, stemming it, and checking if it is the same as a
stemmed query term.
>>d. A spellchecker (like Google's 'did you mean xxx') using edit distance
>>calculation.
>
> That would be nice to have. How hard is that to do? Would it be
> fast? What extra information do you index to do it?
It's not terribly hard, though it would probably need a lot of tuning to
work efficiently on a 400,000+ document database. The simplest method
is to store a database of n-grams (n=3, probably) for each term in the
database, and use this to produce a preliminary set of possible
spellings. Then, take the top 100 or so, and do an edit distance
calculation to calculate the best spelling. This requires quite a lot
of extra storage, of course.
I first heard of this technique being described by someone who had
implemented it for a database of mathematical articles - he had around
100,000 documents, and had got it working acceptably fast, whatever that
means, so I think it's a viable technique.
>>e. A web spider
>
> Why? The main benefit of Xapian is that it doesn't spider, but can be
> used to search structured data in a database. I haven't looked at
> Omega, but a separate project like that seems more the place to put a
> spider. Not as part of Xapian itself.
Indeed, not as part of Xapian itself. What we were wondering was how
many people would find it useful to have an application available which
performed web spidering, and stored the results in a database. (My
suspicion is that few people on the Xapian list would use one, but
writing a spider would bring in a lot of new Xapian users.)
>>f. An easy(ier) way of plugging in the various open source file format
>>converters, for indexing Ms Office and other formats, with a list of which
>>ones actually work!
>
> Again I see this as being higher level than the core Xapian
> information retrieval API.
Agreed.
--
Richard