[Xapian-discuss] Xapian performance on gmane.org compared

Henry henka at cityweb.co.za
Fri Aug 28 09:32:07 BST 2009

Quoting "Olly Betts" <olly at survex.com>:
> As document on http://search.gmane.org, it's chert.
>> What's the DB size on disk?
> 138GB.

That leaves me scratching my head:  performing the same phrase search  
should then be a lot quicker on my DB which is only 4GB.  The number  
of hits I understand will impact the performance, but still...

>> How many search servers is gmane.org using?  Their approx. spec?
> One, which also handles indexing - see "rain" in the list here:
> http://gmane.org/host.php

Once again; big head-scratcher:  our machine is probably a few times  
faster searching a test/sample DB which is 34x smaller.  Something  
doesn't add up.

> As Richard says, my patch in #394 should help, but note that you can
> tune the size of the "pond" by setting POND_SIZE in the environment.
> The default is 100000 which was sane for the situation I wrote it for,
> but higher or lower might be better (and I'd be interested to hear what
> works best for other situations so we can set it sanely automatically).
> There's no benefit in setting it higher than the number of documents
> matched by the AND query of the terms in the phrase.

Yes, I gave the patch a swing, and it halved the search time to ~15s -  
still confusing and terrible compared to the ~4s returned on 'rain'.

The number of docs matched in my query is only about 13k.  Based on  
your last comment, tweaking POND_SIZE will have no affect.

Urgh!  I wish I knew what's going on.  As a final comment FYI:

All using patch from #394.

1xterm Phrase query Match size: ~3,000
POND_SIZE:  10,000:    2.50s
POND_SIZE:  25,000:    2.49s
POND_SIZE:  50,000:    2.49s
POND_SIZE: 100,000:    2.46s
POND_SIZE: 200,000:    2.48s

2xterm Phrase query Match size: ~47,000
POND_SIZE:  10,000:   22s
POND_SIZE:  25,000:   59s
POND_SIZE:  50,000:   21s
POND_SIZE: 100,000:   23s
POND_SIZE: 200,000:  197s

2xterm Phrase query Match size: ~1,700
POND_SIZE:  10,000:    6s
POND_SIZE:  25,000:   18s
POND_SIZE:  50,000:    7s
POND_SIZE: 100,000:    6s
POND_SIZE: 200,000:    6s

3xterm Phrase query Match size: ~13,500
POND_SIZE:  10,000:     8.4s
POND_SIZE:  25,000:     8.3s
POND_SIZE:  50,000:     8.3s
POND_SIZE: 100,000:     8.0s
POND_SIZE: 200,000:     8.3s

Looks like the existing default of 100,000 is indeed the sweet-spot.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: PGP Digital Signature
Url : http://lists.xapian.org/pipermail/xapian-discuss/attachments/20090828/6ecfebe1/attachment.pgp 

More information about the Xapian-discuss mailing list