[Xapian-discuss] Incremental indexing

Jean-Francois Dockes jf at dockes.org
Tue Mar 20 09:36:15 GMT 2012


Marios Titas writes:
 > Hi all,
 > 
 > I am trying to implement an Incremental indexing scheme. The problem
 > is that usually the modified documents are large but the modifications
 > are limited. Ideally, I would like to reindex only the modified parts
 > of these documents. If I am not mistaken, xapian cannot do that. Are
 > there any other approaches?
 > 
 > It would be nice if xapian supported something like the SQL "group
 > by". If it did, then it would be possible to break large documents
 > into several pieces which could be indexed independently. When
 > querying, these pieces would be then combined again using some
 > aggregate function similar to the SQL function sum.

Hi,

The Recoll Xapian-based desktop indexer implements the "break into pieces"
part for big text files. This is done so that the appropriate section of
the document can be loaded for previewing (useful for, ie, big log files).

It doesn't implement independant incremental re-indexing though because it
has no way to know which parts may have changed.

The document parts are linked by a common parent identifier which can be
used to get to the whole document. There is both an entry in the document
data record, used to get to the parent of a result document, and a "parent"
unique term for each part, used to find all the parts of a given parent
document (useful for deleting for example).

In Recoll, this is just a use case of the general mechanism describing
document embedding, and a bit complicated. I imagine that this could be
implemented in different ways.

Cheers,

J.F. Dockes



More information about the Xapian-discuss mailing list