[Xapian-discuss] Re: Evaluating Xapian
Arne Georg Gleditsch
argggh at linpro.no
Fri Jan 28 19:56:26 GMT 2005
* Olly Betts
> That's how you'd do that.
>
> It shouldn't be too bad - it necessarily has to rewrite that document's
> entry in the termlist table, and modify the posting list for NEW_TAG.
>
> Currently it will also needlessly rewrite an unchanged block of the
> postlist table for each unchanged term indexing the document, another
> block for the record table, and another for values (if the document has
> any). The positionlist table may get several blocks rewritten (if
> you're indexing with positional information) depending how long the
> documents are.
>
> This rewriting of unchanged blocks could be optimised out. Much of the
> machinery neded is in place (Xapian::Document reads information from
> disk lazily so, it's easy to tell if someone is writing back the same
> document with unchanged data and values, for example).
>
> I've not implemented this so far simply because it's not a hot spot for
> most users!
>
> Are you doing this a lot?
Well, I'm fiddling with using Xapian for a source-code indexing system
where I want to index several releases of the same source code base
(the Linux kernel, primarily). Where the same file exists in several
releases in an identical revision (which is true for a lot of files,
especially in a stable branch), I'd like to index this [file,revision]
only once. So I'm tagging the indexed documents with the releases
they occur in, incrementally adding tags as I index new releases.
It's not a performance-critical part of the system, but it seems to be
slower than it needs to be. I get the impression that it's actually
slower than indexing a clean tree. (I will try to do a more useful
performance study, I'm just trying to eliminate stupid usage pattern
errors here.) Does replace_document cause an implicit flush of the
database?
Arne.
More information about the Xapian-discuss
mailing list