[Xapian-discuss] merge speed and multiple local DBs design
Ron Kass
ron at pidgintech.com
Tue Oct 16 11:17:29 BST 2007
I have just finished a test of merging 3 files and measuring the speed
and efficiency of the operation, and here are the results:
/------/
/time xapian-compact /fts/FTS_1_part1 /fts/FTS_1_part2 /fts/FTS_1_part3
/fts/FTS_1_FINAL/
postlist ...postlist: Reduced by 49.6601% 18151496K (36551448K -> 18399952K)
record: INCREASED by 2.22356% 15392K (692224K -> 707616K)
termlist: Reduced by 3.10786% 483896K (15570064K -> 15086168K)
position ...position: INCREASED by 2.56846% 880176K (34268672K -> 35148848K)
value: Reduced by 49.828% 2684256K (5387040K -> 2702784K)
spelling: Size unchanged (0K)
synonym: Size unchanged (0K)
real 312m13.114s
user 88m11.783s
sys 7m16.003s
-----
40G FTS_1_part1
26G FTS_1_part2
23G FTS_1_part3
69G FTS_1_FINAL
-----
Test machine specs:
CPU: quad-core, intel Xeon
HDD: 500GB SATA2 WD
Mem: 16GB 677MHz
-----
overall documents in merged test databased: 32 million
-----
Size of the databases (merged) reduced overall from 89G to 69G (22.5%)
Speed though... the process took a bit more than 5 hours and took some
CPU with it. Not too much but some. But more importantly, it did take
some I/O.
In this case, DB size shrinking is not a goal. If anything, the noted
fact that the speed of changes to a compressed database will be slower
than on original uncompressed one (due to decreased reserved space for
updates) is considered a drawback. But I assume for now that this speed
difference is not major (certainly not in the longer run?).
However, taking into consideration that load and resources used for the
merge, and the length of time it took, it appears to be an impractical
measure if we wanted to do it daily, certainly when we grow the size of
the database more.
The idea here was to use a daily DB for faster changes/indexing and then
to merge it every night into a bigger one that contains all the data.
I think it is safe to assume that the time it takes to merge the
databases is linearly related to the size of the databases. In this case
90GB for 30M docs.
If that took 5 hours, merging 100M docs (about 300GB) will probably take
about 17 hours.
Which basically rules out daily merges.
Also, keep in mind that we allocated all the CPU and I/O to the merge.
If we wanted to run regular indexing at the same time, plus searching
heavily on that node, plus the fact that we want to allocate
considerably less resources for 100M docs, all of this might suggest we
are not going to be able to support the daily merge model.
Any thought/suggestions about the above?
Is our test indicative of the merge operation? Are we doing something
wrong? Anything incorrect with my assumptions?
Best regards,
Ron
More information about the Xapian-discuss
mailing list