[Xapian-discuss] XAPIAN_FLUSH_THRESHOLD

Tue Jun 12 00:00:38 BST 2007

On Mon, Jun 11, 2007 at 02:42:33PM -0700, Kevin Duraj wrote:
> I have been using to index XAPIAN_FLUSH_THRESHOLD for 10 million
> documents over 6 months and it works fine and fast until the Xapian
> version 1.0. It used to take 50 minutes to index 10 million documents.
> By installing Xapian 1.0.0.  ... now 10 million documents takes approx
> 16 hours to index. I was looking for bugs in my code but saw that very
> little memory has been used even when threshold was set to 10 million.

This doesn't make sense to me.  Yes, the compression will use more CPU
time, but it shouldn't make much difference to the process size.

A lot of things changed in Xapian 1.0.0, not just compression.  So it
would be useful if you could profile so we can actually see where the
extra time is spent, rather than just guessing.

If you're on Linux, the best tool seems to be oprofile, because it
samples the whole system (kernel and userland).  I believe a 2.6
kernel is needed for best results - just run your indexer under oprofile
with the callgraph enabled (call opcontrol with --callgraph=12 or
some suitable stack depth, then run opreport with --callgraph).

This should show exactly where the extra time is being sent.

> I have installed Xapian 1.0.1 it seems to be using more memory that is
> good.

There weren't any relevant changes to flint between 1.0.0 and 1.0.1
so it seems unlikely you'd see any real difference.

> What might be large for you is small for others. I want to be
> able to index 1 billion of documents in reasonable time.

Incidentally, you'll probably get there fastest by building a number of
smaller databases and merging them using xapian-compact.

> Either Xapian 1.0 does not take in account the threshold or the
> compression that was introduced takes too much time. We need to have
> option in environmental variable to disable any compression.

Perhaps.  More likely the thresholds at which to use the compression
just need tuning, as I've said before.

The most obvious value to play with is COMPRESS_MIN in
backends/flint/flint_table.cc which is currently 4.  I've not tried
experimenting with different values, so it would be interesting to
see some real world benchmarks.

> - I do not care how large the index is, and that compression reduce the 
> size.
> - I care how much time it takes to index 10-100 million of documents
> per one index.

There is a connection between the two though.  Up to a point,
compression will increase indexing speed, because disks are slow
compared to CPUs and RAM.

Cheers,
    Olly