[Xapian-discuss] Incremental indexing limitations

Ron Kass ron at pidgintech.com
Fri Oct 12 13:34:37 BST 2007


Olly Betts wrote:

> On Thu, Oct 11, 2007 at 06:51:50PM +0200, Ron Kass wrote:
>   
>> 1. Size of such a database will actually be about 3K per document. That 
>> is bigger than the text of the documents themselves. This is quite 
>> surprising really, as what Xapian is supposed to be storing is just the 
>> term IDs and the positions. Any ideas why its bigger than the original 
>> size as opposed to 1/3 of the size lets say?
>>     
>
> It's usually smaller, but it does depend on the nature of the source
> data.  If you're indexing with positional information, I wouldn't expect
> to achieve 1/3 the size.  It might be interesting to see the output of
> "ls -l" on the database directory.
>   
-rw------- 1 apache apache          0 Oct 12 00:09 flintlock
-rw-r--r-- 1 apache apache         12 Oct 12 10:11 iamflint
-rw-r--r-- 1 apache apache     103204 Oct 12 10:02 position.baseA
-rw-r--r-- 1 apache apache 6853787648 Oct 12 10:11 position.DB
-rw-r--r-- 1 apache apache     114206 Oct 12 10:02 postlist.baseA
-rw-r--r-- 1 apache apache     113034 Oct 12 09:48 postlist.baseB
-rw-r--r-- 1 apache apache 7483244544 Oct 12 10:02 postlist.DB
-rw-r--r-- 1 apache apache       2095 Oct 12 10:02 record.baseA
-rw-r--r-- 1 apache apache  138035200 Oct 12 10:11 record.DB
-rw-r--r-- 1 apache apache      46899 Oct 12 10:02 termlist.baseA
-rw-r--r-- 1 apache apache 3114835968 Oct 12 10:11 termlist.DB
-rw-r--r-- 1 apache apache      16588 Oct 12 10:02 value.baseA
-rw-r--r-- 1 apache apache 1102151680 Oct 12 10:11 value.DB    

This is for:
    number of documents = 6499500
    average document length = 202.329
> Part of the database space will actually be unused (what xapian-compact
> does is to minimise this unused space).
>   
When running compact we save about 25%, so our below estimate is right 
(about 75% of the space is used)
here is a sample compress (from an earlier stage of the data):
postlist: Reduced by 56.0786% 2661936K (4746792K -> 2084856K)
record: Reduced by 2.94789% 2272K (77072K -> 74800K)
termlist: Reduced by 5.68267% 97560K (1716800K -> 1619240K)
position: Reduced by 2.5952% 104056K (4009552K -> 3905496K)
value: Reduced by 50.6533% 300840K (593920K -> 293080K)
spelling: Size unchanged (0K)
synonym: Size unchanged (0K)             

Note however, that such a compress takes a lot of time and would doubtly 
be able to run daily, as it would probably take more than a day to run 
when its big (i.e. when its 100M documents long).
> The Btree implementation has two update modes - linear and non-linear.
> In linear mode, each block is filled as full as it will go before the
> next block is started, so blocks end up nearly 100% used (apart for the
> last block updated in a run of linear updates).  In non-linear mode,
> blocks fill until an update won't fit in the remaining space in the
> block where it needs to go, at which point that block is split into 2
> each ~50% full, so on average non-linear update results in ~75% use of
> blocks.  This is actually how a B-tree achieves its amortised update
> cost.
>
> If you're just appending documents with add_document(), then for most
> tables you'll only get linear update, and they end up pretty compact.
> The exception is the postlist table - the first flush will be entirely
> linear, and you may get some linear update during subsequent flushes
> but there's inevitably going to be non-linear update.  If the flush
> threshold is larger, you'll increase the scope for (and size of) spells
> of linear updating.
>   
Documents are actually added with replace_document_by_term. This is done 
so that unique external document IDs are preserved.
Before, we used replace_document and used to specify docIDs, but since 
merge doesn't preserve the docIDs, we shifted to the above stated approach.
> There's certainly scope for improvement here though - for example, we
> currently store the document lengths in every posting list entry, but it
> would save quite a bit of space to store them just once, and it's likely
> to be faster overall when you consider the extra I/O and caching
> overhead.
>
> I'm also working on a replacement Btree manager which reduces the
> amount of disk space which the "wood" of the Btree takes.  That's
> particularly an issue for the positions table which contains a lot of
> items, many of which are very small.
>   
It would be interesting to test it and see the effect on data like ours.
>   
>> 2. Such an indexing process will start very fast, but when reaching a 
>> database size of 2M or 3M documents, each flush will take 2-4 minutes. 
>> This is already very slow for a flush of 10K small documents. Changing 
>> flushes to every 1K document doesn't help.
>>     
>
> For greater effiency, you'll want to make the flush threshold larger,
> not smaller.
>
>   
>> It seems the time a flush 
>> takes is not related so directly to the size of the flush itself but 
>> does strongly relate to the size of the database itself. How come? What 
>> happens during a flush?
>>     
>
> Most of the work is updating the inverted file.  This is done by making
> a pass in sorted order over the postlist table, applying changes.
> Because we make a pass over the table, this is much more efficient per
> document if you use a larger flush threshold (but not so large that the
> buffered changes are fighting with the OS cache of blocks from the
> Xapian database).
>
> Ideally this would self-tune, but at present if you want to maximise
> throughput you need to set XAPIAN_FLUSH_THRESHOLD, and experiment a bit
> with the value you set it to.
>   
We found that setting the auto threshold to a very high number and then 
deciding on the flush time in-code is somewhat smarter.
This is mainly so we can flush data not only on number of documents 
added, but also after a specific time (so that documents don't wait too 
long to show up in the index)
>> 3. If the time it takes to flush 10K documents when the DB is 3M 
>> documents, is 2.5 minutes, does it actually mean that when the database 
>> is 100M documents, each flush will take over an hour?
>>     
>
> It's not a linear relationship.
>   
While not exactly linear, we do see a constant rise.. here is a sample 
of the last 2 million documents flushes. The database is not 7.5 million 
documents.
Note that when the database was 3 million documents, flush tool around 
130 seconds. So, so far relatively linear. Certainly growing all the time..

(note: at this point the database is 5.4M docs)

    ...
    Flushing 100000 documents ...[Done: 294 sec]
    Flushing 100000 documents ...[Done: 296 sec]
    Flushing 100000 documents ...[Done: 241 sec]
    Flushing 100000 documents ...[Done: 329 sec]
    Flushing 100000 documents ...[Done: 354 sec]
    Flushing 100000 documents ...[Done: 317 sec]
    Flushing 100000 documents ...[Done: 322 sec]
    Flushing 100000 documents ...[Done: 309 sec]
    Flushing 100000 documents ...[Done: 344 sec]
    Flushing 100000 documents ...[Done: 337 sec]
    Flushing 100000 documents ...[Done: 335 sec]
    Flushing 100000 documents ...[Done: 318 sec]
    Flushing 100000 documents ...[Done: 346 sec]
    Flushing 100000 documents ...[Done: 343 sec]
    Flushing 100000 documents ...[Done: 341 sec]
    Flushing 100000 documents ...[Done: 355 sec]
    Flushing 100000 documents ...[Done: 364 sec]
    Flushing 100000 documents ...[Done: 369 sec]
    Flushing 100000 documents ...[Done: 384 sec]
    Flushing 100000 documents ...[Done: 378 sec] 
    ...

(note: at this point the database is 7.4M docs)

On the other hand, it seems that the actual indexing of documents is 
getting speedier recently, maybe due to the fact that the more data in 
the database, the less new terms are presented and needs to be added to 
the 'dictionary'?

>   
>> So.. thinking about how to maybe optimize this process, the concept of 
>> using a "live" small database for updates and then merge it into a big 
>> database comes to mind.
>>     
>
> That's certainly an approach that can work for some situations.
>
>   
>> However, there are two issues here:
>> 1. Such a merge is slow. It will take quite a lot of time (many hours) 
>> to compact/merge each such live database into a main one. If this 
>> process has to be done hourly lets say, and the process takes more than 
>> an hour, we are faced with a critical problem.
>> 2. Such a merge process seems to take quite a lot of resources from the 
>> system, limiting CPU, I/O and memory for the more urgent task.. indexing 
>> and searching.
>>     
>
> Note that if your system has a quiet time of day (as most do) you can
> perform the merge then, and still allow searching of new documents
> between merges by using Xapian's ability to search over multiple
> databases together.
>
>   
>> 3. It also means that we can never use more than 50% of the diskspace on 
>> the server. In fact less than 40% or 45% to be safe. This is because the 
>> compact process is merging the big database with the small one, into a 
>> new database. So the new database will be bigger than the original big 
>> one. So just because of this process, the server diskspace can not be 
>> utilized effectively.
>>     
>
> Disk is pretty cheap these days though (until you run out of drive
> bays!)  But you're right that this is a drawback of this approach,
> although it can be useful to have that sort of amount of spare space for
> other reasons - for example you're a bit stuck otherwise if you want to
> change how you index and need to rebuild the index from scratch while
> still allowing searches of the existing database while the rebuild
> happens.
>
> Cheers,
>     Olly
>   


More information about the Xapian-discuss mailing list