[Xapian-discuss] Incremental indexing limitations
Ron Kass
ron at pidgintech.com
Fri Oct 12 13:34:37 BST 2007
Olly Betts wrote:
> On Thu, Oct 11, 2007 at 06:51:50PM +0200, Ron Kass wrote:
>
>> 1. Size of such a database will actually be about 3K per document. That
>> is bigger than the text of the documents themselves. This is quite
>> surprising really, as what Xapian is supposed to be storing is just the
>> term IDs and the positions. Any ideas why its bigger than the original
>> size as opposed to 1/3 of the size lets say?
>>
>
> It's usually smaller, but it does depend on the nature of the source
> data. If you're indexing with positional information, I wouldn't expect
> to achieve 1/3 the size. It might be interesting to see the output of
> "ls -l" on the database directory.
>
-rw------- 1 apache apache 0 Oct 12 00:09 flintlock
-rw-r--r-- 1 apache apache 12 Oct 12 10:11 iamflint
-rw-r--r-- 1 apache apache 103204 Oct 12 10:02 position.baseA
-rw-r--r-- 1 apache apache 6853787648 Oct 12 10:11 position.DB
-rw-r--r-- 1 apache apache 114206 Oct 12 10:02 postlist.baseA
-rw-r--r-- 1 apache apache 113034 Oct 12 09:48 postlist.baseB
-rw-r--r-- 1 apache apache 7483244544 Oct 12 10:02 postlist.DB
-rw-r--r-- 1 apache apache 2095 Oct 12 10:02 record.baseA
-rw-r--r-- 1 apache apache 138035200 Oct 12 10:11 record.DB
-rw-r--r-- 1 apache apache 46899 Oct 12 10:02 termlist.baseA
-rw-r--r-- 1 apache apache 3114835968 Oct 12 10:11 termlist.DB
-rw-r--r-- 1 apache apache 16588 Oct 12 10:02 value.baseA
-rw-r--r-- 1 apache apache 1102151680 Oct 12 10:11 value.DB
This is for:
number of documents = 6499500
average document length = 202.329
> Part of the database space will actually be unused (what xapian-compact
> does is to minimise this unused space).
>
When running compact we save about 25%, so our below estimate is right
(about 75% of the space is used)
here is a sample compress (from an earlier stage of the data):
postlist: Reduced by 56.0786% 2661936K (4746792K -> 2084856K)
record: Reduced by 2.94789% 2272K (77072K -> 74800K)
termlist: Reduced by 5.68267% 97560K (1716800K -> 1619240K)
position: Reduced by 2.5952% 104056K (4009552K -> 3905496K)
value: Reduced by 50.6533% 300840K (593920K -> 293080K)
spelling: Size unchanged (0K)
synonym: Size unchanged (0K)
Note however, that such a compress takes a lot of time and would doubtly
be able to run daily, as it would probably take more than a day to run
when its big (i.e. when its 100M documents long).
> The Btree implementation has two update modes - linear and non-linear.
> In linear mode, each block is filled as full as it will go before the
> next block is started, so blocks end up nearly 100% used (apart for the
> last block updated in a run of linear updates). In non-linear mode,
> blocks fill until an update won't fit in the remaining space in the
> block where it needs to go, at which point that block is split into 2
> each ~50% full, so on average non-linear update results in ~75% use of
> blocks. This is actually how a B-tree achieves its amortised update
> cost.
>
> If you're just appending documents with add_document(), then for most
> tables you'll only get linear update, and they end up pretty compact.
> The exception is the postlist table - the first flush will be entirely
> linear, and you may get some linear update during subsequent flushes
> but there's inevitably going to be non-linear update. If the flush
> threshold is larger, you'll increase the scope for (and size of) spells
> of linear updating.
>
Documents are actually added with replace_document_by_term. This is done
so that unique external document IDs are preserved.
Before, we used replace_document and used to specify docIDs, but since
merge doesn't preserve the docIDs, we shifted to the above stated approach.
> There's certainly scope for improvement here though - for example, we
> currently store the document lengths in every posting list entry, but it
> would save quite a bit of space to store them just once, and it's likely
> to be faster overall when you consider the extra I/O and caching
> overhead.
>
> I'm also working on a replacement Btree manager which reduces the
> amount of disk space which the "wood" of the Btree takes. That's
> particularly an issue for the positions table which contains a lot of
> items, many of which are very small.
>
It would be interesting to test it and see the effect on data like ours.
>
>> 2. Such an indexing process will start very fast, but when reaching a
>> database size of 2M or 3M documents, each flush will take 2-4 minutes.
>> This is already very slow for a flush of 10K small documents. Changing
>> flushes to every 1K document doesn't help.
>>
>
> For greater effiency, you'll want to make the flush threshold larger,
> not smaller.
>
>
>> It seems the time a flush
>> takes is not related so directly to the size of the flush itself but
>> does strongly relate to the size of the database itself. How come? What
>> happens during a flush?
>>
>
> Most of the work is updating the inverted file. This is done by making
> a pass in sorted order over the postlist table, applying changes.
> Because we make a pass over the table, this is much more efficient per
> document if you use a larger flush threshold (but not so large that the
> buffered changes are fighting with the OS cache of blocks from the
> Xapian database).
>
> Ideally this would self-tune, but at present if you want to maximise
> throughput you need to set XAPIAN_FLUSH_THRESHOLD, and experiment a bit
> with the value you set it to.
>
We found that setting the auto threshold to a very high number and then
deciding on the flush time in-code is somewhat smarter.
This is mainly so we can flush data not only on number of documents
added, but also after a specific time (so that documents don't wait too
long to show up in the index)
>> 3. If the time it takes to flush 10K documents when the DB is 3M
>> documents, is 2.5 minutes, does it actually mean that when the database
>> is 100M documents, each flush will take over an hour?
>>
>
> It's not a linear relationship.
>
While not exactly linear, we do see a constant rise.. here is a sample
of the last 2 million documents flushes. The database is not 7.5 million
documents.
Note that when the database was 3 million documents, flush tool around
130 seconds. So, so far relatively linear. Certainly growing all the time..
(note: at this point the database is 5.4M docs)
...
Flushing 100000 documents ...[Done: 294 sec]
Flushing 100000 documents ...[Done: 296 sec]
Flushing 100000 documents ...[Done: 241 sec]
Flushing 100000 documents ...[Done: 329 sec]
Flushing 100000 documents ...[Done: 354 sec]
Flushing 100000 documents ...[Done: 317 sec]
Flushing 100000 documents ...[Done: 322 sec]
Flushing 100000 documents ...[Done: 309 sec]
Flushing 100000 documents ...[Done: 344 sec]
Flushing 100000 documents ...[Done: 337 sec]
Flushing 100000 documents ...[Done: 335 sec]
Flushing 100000 documents ...[Done: 318 sec]
Flushing 100000 documents ...[Done: 346 sec]
Flushing 100000 documents ...[Done: 343 sec]
Flushing 100000 documents ...[Done: 341 sec]
Flushing 100000 documents ...[Done: 355 sec]
Flushing 100000 documents ...[Done: 364 sec]
Flushing 100000 documents ...[Done: 369 sec]
Flushing 100000 documents ...[Done: 384 sec]
Flushing 100000 documents ...[Done: 378 sec]
...
(note: at this point the database is 7.4M docs)
On the other hand, it seems that the actual indexing of documents is
getting speedier recently, maybe due to the fact that the more data in
the database, the less new terms are presented and needs to be added to
the 'dictionary'?
>
>> So.. thinking about how to maybe optimize this process, the concept of
>> using a "live" small database for updates and then merge it into a big
>> database comes to mind.
>>
>
> That's certainly an approach that can work for some situations.
>
>
>> However, there are two issues here:
>> 1. Such a merge is slow. It will take quite a lot of time (many hours)
>> to compact/merge each such live database into a main one. If this
>> process has to be done hourly lets say, and the process takes more than
>> an hour, we are faced with a critical problem.
>> 2. Such a merge process seems to take quite a lot of resources from the
>> system, limiting CPU, I/O and memory for the more urgent task.. indexing
>> and searching.
>>
>
> Note that if your system has a quiet time of day (as most do) you can
> perform the merge then, and still allow searching of new documents
> between merges by using Xapian's ability to search over multiple
> databases together.
>
>
>> 3. It also means that we can never use more than 50% of the diskspace on
>> the server. In fact less than 40% or 45% to be safe. This is because the
>> compact process is merging the big database with the small one, into a
>> new database. So the new database will be bigger than the original big
>> one. So just because of this process, the server diskspace can not be
>> utilized effectively.
>>
>
> Disk is pretty cheap these days though (until you run out of drive
> bays!) But you're right that this is a drawback of this approach,
> although it can be useful to have that sort of amount of spare space for
> other reasons - for example you're a bit stuck otherwise if you want to
> change how you index and need to rebuild the index from scratch while
> still allowing searches of the existing database while the rebuild
> happens.
>
> Cheers,
> Olly
>
More information about the Xapian-discuss
mailing list