[Xapian-discuss] Rqt for Features

Richard Boulton richard at tartarus.org
Fri Jul 9 16:52:02 BST 2004


On Fri, Jul 09, 2004 at 03:59:16PM +0100, Tim Brody wrote:
> I've been working on a xapian implementation for the last month or so and
> have implemented (well, hacked until it worked) QueryParser for Perl.

I see these have been added to the Search::Xapian module on CPAN.  Thanks
for your work.

> Could xapian have the ability to specify docids? My system - as I'm sure
> many others do - maintains it's own ids for people, docs etc. For the moment
> I've opted to rebuild the index from scratch everyday, rather than
> maintaining a docid => myid mapping in order to perform incremental nightly
> changes.

With Xapian as it currently stands, the way to do this is to specify
a unique term, and store it in each document.  The unique terms would
be comprised of a prefix followed by your document identifiers.
Traditionally, "Q" has been used for the prefix, but any prefix which
will avoid collisions with other terms is acceptable.

Whenever a document is modified, you would first open the postlist for the
term, which gives you a list of all documents containing the term, and
delete these documents (hopefully, this list would be of length 0 or 1).
Then, add the new document.

There is a proposal to add a new API method to delete all documents
containing a given term, which would ease the implementation of this scheme
(I'm not sure of the status of this proposal).

This method is used by "scriptindex" - see the implementation of the "uniq"
command.  It can be more flexible and robust than using Xapian's document
identifiers: in particular, it allows you to use any string as an
identifier, rather than restricting you to 32 bit numbers.  Also, if you
combine databases together using Xapian's multidatabase facility, the
Xapian docids will change (an interleaving scheme is used to disambiguate
document identifiers document), which could break software which relies on
the document identifiers being tied more closely to the contents of the
document.

Using Xapian's built in identifiers should be more efficient, allowing
documents to be referenced without having to perform a database lookup to
determine the internal identifier first, but I'm not sure how much of a
cost this actually incurs.

> The cleanest method from outside of the API would be if replace_document
> accepted a non-existent (to xapian) docid, in which case it adds the
> document rather than excepting (i.e. SQL's "REPLACE" behaviour).

This would be a perfectly reasonable extension.  I haven't time right now
to take a look at how feasible it would be, but I can't think of any likely
problem.

Could you file a bug request for this feature request, so that it doesn't
get lost?

-- 
Richard



More information about the Xapian-discuss mailing list