[Xapian-discuss] xapian performance

Wed Nov 22 23:31:35 GMT 2006

On Wed, Nov 22, 2006 at 06:55:21PM -0200, Fernando Nemec wrote:
> Do you think its better to have a large set of queries or this will do
> fine?

The effects will depend on the queries, but Arjen has already tested a
larger set so I was mostly hoping you could confirm there was no
regression for the two term case.

> This was made *without* experimental phrase optimization patch:
> 
> <!--Xapian::Query(lula)-->
> 0m0.412s
> <!--Xapian::Query((presidente PHRASE 2 lula))-->
> 1m5.062s
> <!--Xapian::Query((governo PHRASE 6 do PHRASE 6 estado PHRASE 6 de PHRASE 6 sao PHRASE 6 paulo))-->
> 1m14.193s
> 
> That was made *with* phrase optimization patch:
> 
> <!--Xapian::Query(lula)-->
> 0m0.379s
> <!--Xapian::Query((presidente PHRASE 2 lula))-->
> 0m58.514s
> <!--Xapian::Query((governo PHRASE 6 do PHRASE 6 estado PHRASE 6 de PHRASE 6 sao PHRASE 6 paulo))-->
> 1m2.503s

It's interesting that the first case is sped up (by 8% which is little
high to be noise) - the patch shouldn't change non-phrase queries at
all.  Is this SVN HEAD with and without this patch?

http://www.oligarchy.co.uk/xapian/patches/xapian-experimental-phrase-optimisation-v2.patch

Are you timing Omega?  If so, did you try removing $topterms from your
query template?

And how are you timing?

If this is "wall-clock" time from the "time" utility/built-in, what are
the user and system times?

> I don't know if this is relevant but may be it is. On this query
> 
> <!--Xapian::Query((presidente PHRASE 2 lula))-->
> 
> cache seems to do not affect this query at all. Even if I search the
> exact same query seconds later the search time is high and almost the
> same.

I think this must mean that we need to read so many disk blocks for
this query that not many end up cached.  I think you said you had 1GB
of RAM, so there might not be all that much left for caching.  What
does the "free" command report?

> If there's anything else I can do to help to fix this issue, please
> let me know.

It would be interesting to try measuring just how many blocks we
actually read - this will be a repeatable measure, whereas timings
from cold disk cache are much harder to exactly repeat.  Try applying
this patch:

http://www.oligarchy.co.uk/xapian/patches/flint-count-read-blocks.patch

This reports the number of blocks read from each table of each flint
database to stderr (the report happens whenever a database is closed).

Cheers,
    Olly