[Xapian-discuss] Incremental indexing limitations

Sun Oct 14 05:27:24 BST 2007

On Fri, Oct 12, 2007 at 02:34:37PM +0200, Ron Kass wrote:
> Olly Betts wrote:
> >It might be interesting to see the output of
> >"ls -l" on the database directory.
> >  
> -rw------- 1 apache apache          0 Oct 12 00:09 flintlock
> -rw-r--r-- 1 apache apache         12 Oct 12 10:11 iamflint
> -rw-r--r-- 1 apache apache     103204 Oct 12 10:02 position.baseA
> -rw-r--r-- 1 apache apache 6853787648 Oct 12 10:11 position.DB
> -rw-r--r-- 1 apache apache     114206 Oct 12 10:02 postlist.baseA
> -rw-r--r-- 1 apache apache     113034 Oct 12 09:48 postlist.baseB
> -rw-r--r-- 1 apache apache 7483244544 Oct 12 10:02 postlist.DB
> -rw-r--r-- 1 apache apache       2095 Oct 12 10:02 record.baseA
> -rw-r--r-- 1 apache apache  138035200 Oct 12 10:11 record.DB
> -rw-r--r-- 1 apache apache      46899 Oct 12 10:02 termlist.baseA
> -rw-r--r-- 1 apache apache 3114835968 Oct 12 10:11 termlist.DB
> -rw-r--r-- 1 apache apache      16588 Oct 12 10:02 value.baseA
> -rw-r--r-- 1 apache apache 1102151680 Oct 12 10:11 value.DB    
> 
> This is for:
>    number of documents = 6499500
>    average document length = 202.329

Nothing very suprising there.  The postlist.DB seems larger than I'd
expect, but that may be down to hebrew terms and UTF-8 encoding.  Also
from what you report below it compacts a lot (assuming the above are
uncompacted numbers).

> here is a sample compress (from an earlier stage of the data):
> postlist: Reduced by 56.0786% 2661936K (4746792K -> 2084856K)
> record: Reduced by 2.94789% 2272K (77072K -> 74800K)
> termlist: Reduced by 5.68267% 97560K (1716800K -> 1619240K)
> position: Reduced by 2.5952% 104056K (4009552K -> 3905496K)
> value: Reduced by 50.6533% 300840K (593920K -> 293080K)

I'm suprised the value table has much slack in it, but it's
comparatively small anyway.

> Note however, that such a compress takes a lot of time and would doubtly 
> be able to run daily, as it would probably take more than a day to run 
> when its big (i.e. when its 100M documents long).

I'd be suprised if it took that long.  The gmane database of ~50M
documents is 116GB and takes ~2h20 to compact.  That machine has 3GB
of RAM and an Athlon 64 3000+, so it's not ridiculously beefy.

> Documents are actually added with replace_document_by_term. This is done 
> so that unique external document IDs are preserved.

Are there document duplicates during the initial build of the database?

If not, you may find it faster to have a special "build from scratch"
mode which just uses add_document() and then switches to
replace_document_by_term() once things are running incrementally.  I've
not profiled this though - it's possible the overhead of checking is
too small to matter.

> Before, we used replace_document and used to specify docIDs, but since 
> merge doesn't preserve the docIDs, we shifted to the above stated approach.

Specifying docids out of sequence will also be slower.  The case of
sequentially appending new documents (which you'll be mostly doing even
with replace_document_by_term() I imagine) is well optimised, and likely
to improve further in the future.

> >I'm also working on a replacement Btree manager which reduces the
> >amount of disk space which the "wood" of the Btree takes.  That's
> >particularly an issue for the positions table which contains a lot of
> >items, many of which are very small.
>
> It would be interesting to test it and see the effect on data like ours.

Unfortunately it's not currently a drop-in replacement.  I've been
developing it as a standalone bit of code so far so I can easily test it
as I go along.  All the basic stuff works, but it doesn't yet support
splitting up large tags or versioning the Btree.

> We found that setting the auto threshold to a very high number and then 
> deciding on the flush time in-code is somewhat smarter.
> This is mainly so we can flush data not only on number of documents 
> added, but also after a specific time (so that documents don't wait too 
> long to show up in the index)

Note that you can just set XAPIAN_FLUSH_THRESHOLD, and still call
flush() explicitly after a specific time.  The explicit flush() will
reset the counter for the next automatic one.

> >>3. If the time it takes to flush 10K documents when the DB is 3M 
> >>documents, is 2.5 minutes, does it actually mean that when the database 
> >>is 100M documents, each flush will take over an hour?
> >
> >It's not a linear relationship.
>
> While not exactly linear, we do see a constant rise..

It'll be slower with more data, but in general it won't be twice as slow
with twice as much data, so you can't simply extrapolate linearly.
You'll also probably see some cliffs and plateaus in the graph as the
working set stops fitting in the CPU cache, the disk cache, the OS VM
cache, etc.

> On the other hand, it seems that the actual indexing of documents is 
> getting speedier recently, maybe due to the fact that the more data in 
> the database, the less new terms are presented and needs to be added to 
> the 'dictionary'?

It may be slightly more expensive to add a term for the first time,
though I'd be suprised if it was noticable like this.  I'm not sure I
can think of a better explanation though.

Cheers,
    Olly