Xapian 1.3.5 snapshot performance and index size

Jean-Francois Dockes jf at dockes.org
Sun Apr 10 15:47:01 BST 2016


Hi,

I ran some tests with Recoll to compare Xapian 1.2.22 and 1.3.5 performance.

I mostly used two relatively small document sets (realistic/typical recoll
data subsets).

The first set is a 2.2 GB mbox folder, with approximately 56K messages in
275 files, producing approximately 64K documents (because of attachments).

The second set is a 11 GB folder with 5300 PDF files in it (random PDFS
harvested on Google).

The machine has an Intel Core i7-4770T CPU @ 2.50GHz (4 cores +
hyperthreading), 8 GB of memory and SSD storage.

I repeated most tests multiple times, and I give the best times here (the
variation was not very significant anyway).

PDF directory:
-------------

Xapian 1.2.22
Index size 399 MB
time: real 3m15s user 22m19s sys 1m9s

Xapian 1.3.5
Index size 614 MB
time: real 3m18s user 22m21s sys 1m28s

Mail directory:
--------------

Xapian 1.2.22
Index size 615 MB
time: real 2m20s user 7m57s sys 1m34s

Approximately 2mn of CPU time are spent in the actual Xapian thread (which
gets xapian::document as input and processes them into the index).

Xapian 1.3.5
Index size 794MB
time: real 3m47s user 7m14s sys 1m59s

Approximately 2m40s of CPU time are spent in the Xapian thread.


Indexing performance, interpretation:
------------------------------------

On the PDF directory, the performance of the Xapian thread is masked by the
processing of the PDF input. The CPU utilization is good (CPU time/clock
time is around 7, compared to 8 possible threads).

On the mail directory, the input processing is less significant, the single
index update thread is the bottleneck, and the Xapian version makes a
difference, Xapian 1.3 being significantly slower.

The CPU utilization is less than with PDF input, because the process is
often waiting for the Xapian thread, which is almost never waiting for
input. The situation is worse with 1.3 than with 1.2, because 1.3 is slower.

I am not sure why there is so much more difference betweeen the time of the
Xapian thread and the wall time for 1.3, but one possible explanation would
be more I/O waits.

The increase in index size between 1.2.22 and 1.3.5 is quite significant,
around 50%, concentrated on the positions file.


Phrase queries:
---------------

I ran a query on both versions of the mail index after copying the data to
a machine with spinning disks. The queries are run just after a reboot,
they find 3 documents (not shown):

xapian 1.2
time recoll -t -q '"to be or not to be"'
 real   0m5.766s user   0m0.108s sys    0m0.600s

xapian 1.3
time recoll -t -q '"to be or not to be"'
 real   0m2.178s user   0m0.072s sys    0m0.048s

This is a very significant improvement of phrase query time, which would,
I imagine, become even more spectacular on a really big index. 

Home directory
--------------

For another more realistic data point, I used my whole home (on
SSD): 10GB, 79K files yielding 112K documents.

I crashed the machine while trying to purge the cache for the query tests,
so the phrase queries are really cold :)

xapian 1.2
  Index size 1758228 kb
  Indexing time: real 10m29s user 38m40s sys 13m22s
  Cold phrase query:
   time recoll -t -q '"with a little help from my friends"'
   real 0m0.441s user 0m0.093s sys 0m0.028s

xapian 1.3
  Index size: 2701448 kb
  Indexing time: real 15m2s user 36m53s sys 13m24s
  Cold phrase query:
    time recoll -t -q '"with a little help from my friends"'
    real 0m0.175s user 0m0.103s sys 0m0.019s

On SSD, phrases searches are also much faster with 1.3, but this would not
be significant in the personal use case (might be a different issue on a
public site running myriads of queries of course)


My conclusion at this point:
---------------------------

I think that most Recoll users will not notice the slightly slower
indexing.

Some might notice the 50% index size increase. Excessive index size is
already one relatively rare, but recurring complaint. Except if I did
something wrong: I'm actually quite surprised by it.

Of course, having faster phrase searches is a good thing. Maybe I have not
run the right tests to display the maximum effect of the new code ?

As it is, and still hoping that more 1.3 optimization will improve the
situation, I have to wonder if the price payed for faster phrase searches
is not a bit too high, given that these are rather unfrequent queries, and
that the improvement, while very significant, does not completely solve the
issue.

jf



More information about the Xapian-discuss mailing list