[Xapian-discuss] Search performance issues and profiling/debugging

Thu Oct 25 19:27:33 BST 2007

We ran the same test with databases 2,4,6,8 and 10 on a:
    Dell 2950,
        CentOS 5.0 64bit
        Two quad-Xeon CPUs (8 cores total)
        16GB mem
        6 disks in raid 5.
    (a nice and strong machine overall)

While speed was better in most cases, it was still long for repeated 
searches.. around 0.2-0.3 seconds for sequential searches (what was 
around 1 second on the other machine).

Notes:
    * There is mysql server running on that machine, but its not doing much.
    * Used the latest (as of today) Xapian for the SVN HEAD (on the 
other server it was a prior one - 1.0.3)
    * The run you see the output for, was not the first time the test 
was ran.. so data should be already cached.
    * Times were about 0.20 - 0.21 second (for 'no recip' search)
    * Estimates still vary.
    * Used latest compiled OProfile (got some warnings though when 
printing output with --callgraph=12 though - see output files)
Here are the debug info for 'no recip' search:

Initial test: 
http://www.pidgintech.com/other/fts/test/no_recip/test-initial.out
Summary:
    First search takes 5.36sec, 71st scan takes an 12.5sec (maybe a one 
time glitch). 4 other searches take about 1-2sec. Most other searches 
(94 of them) take between 0.2-0.3sec. Estimates vary in the area of 
7700-8300 results.

Second test: http://www.pidgintech.com/other/fts/test/ 
<http://www.pidgintech.com/other/fts/test/no_recip/test.out>no_recip/ 
<http://www.pidgintech.com/other/fts/test/no_recip/test.out>test.out 
<http://www.pidgintech.com/other/fts/test/no_recip/test.out> 
<http://www.pidgintech.com/other/fts/test/no_recip/test.out>
Summary:
    All searches took very close to 0.2sec. Estimates vary in the area 
of 7700-8300 results.

Oprofile output:
    Simple: http://www.pidgintech.com/other/fts/test/no_recip/oprofile.txt
    Detailed: 
http://www.pidgintech.com/other/fts/test/no_recip/oprofile-d.txt
    CallGraph=12: 
http://www.pidgintech.com/other/fts/test/no_recip/oprofile-callgraph.txt
    (Warnings seen during call to print CallGraph=12: 
http://www.pidgintech.com/other/fts/test/no_recip/oprofile-callgraph.warn)

We THEN ran a search on a different term 'gle suspension':
Results were a little disturbing, as estimates varied even though they 
were all below 100.. quite a low number to be inaccurate for estimates 
(not that its not disturbing enough that estimates vary on same query 
with same database open without database changes)..
However some notes:
    * The run you see the output for, was not the first time the test 
was ran.. so data should be already cached. However, the output was a 
bit after the original searches, so the first search took a bit more, 
for obvious reasons.
    * Times were about 0.015 second for all sequential searches (for 
'gle suspension' search). Difference could be related also to the 
estimates... around 65 compared to around 8000 for 'no recip'
    * Estimates still vary, and in the below 100 range. From 59 to 62.
    * We use a: Check at least of 100!

<http://www.pidgintech.com/other/fts/test/oprofile.txt>Test: 
http://www.pidgintech.com/other/fts/test/ 
<http://www.pidgintech.com/other/fts/test/gle/test.out>gle/ 
<http://www.pidgintech.com/other/fts/test/gle/test.out>test.out 
<http://www.pidgintech.com/other/fts/test/gle/test.out>
Summary:
    All searches took very close to 0.2sec. Estimates vary in the area 
of 7700-8300 results.

Oprofile output:
    Simple: http://www.pidgintech.com/other/fts/test/gle/oprofile.txt
    Detailed: http://www.pidgintech.com/other/fts/test/gle/oprofile-d.txt
    CallGraph=12: 
http://www.pidgintech.com/other/fts/test/gle/oprofile-callgraph.txt

Following this, we've run a test printing the docIDs to see why the 
variations.. Here are docIDs of the few variations..
[56297643;103475983;2238694;83603672;82696667;2041212;96212193;105779847;77279273;96496372;108118072;2403755;18899917;84542318;61840853;27258501;65134562;84167102;6266426;96834248;30386133;85952982;81182092;5224656;81838668;]
[2238694;83603672;56134222;82696667;2041212;105779847;96496372;108118072;2403755;18899917;27258501;65134562;84167102;6266426;85952982;81182092;5224656;2283529;10805044;83578837;9103611;81702717;99243007;55053292;103318422;]    

[56297643;103475983;83603672;82696667;2041212;96212193;105779847;77279273;96496372;108118072;18899917;84542318;61840853;27258501;65134562;84167102;6266426;96834248;30386133;85952982;81182092;5224656;81838668;9041043;83578837;] 

[56297643;103475983;2238694;83603672;82696667;2041212;96212193;105779847;77279273;96496372;108118072;18899917;84542318;61840853;27258501;65134562;84167102;6266426;96834248;30386133;85952982;81182092;5224656;81838668;2283529;]
...
the above are just 4 random samples..

Ok.. Now this looked a little disturbing..
Not only speed issue or estimates vary, search results vary!
Here is again, the list of the above 4 searches, side by side, ordered 
by DocID with [MISSING] where there is no result that should have been 
there according to other searches...

Search 1                Search 2               Search 3              
Search 4
----------              ----------              ----------            
-----------
2041212         #    2041212        #    2041212         #    2041212
2238694         #    2238694        #    [MISSING]    #    2238694
[MISSING]    #    2283529        #    [MISSING]    #    2283529
2403755         #    2403755        #    [MISSING]    #    [MISSING]
5224656         #    5224656        #    5224656         #    5224656
6266426         #    6266426        #    6266426         #    6266426
[MISSING]    #    [MISSING]   #    9041043          #    [MISSING]
[MISSING]    #    9103611        #    [MISSING]     #    [MISSING]
[MISSING]    #    10805044      #    [MISSING]     #    [MISSING]
18899917       #    18899917      #    18899917       #    18899917
27258501       #    27258501      #    27258501       #    27258501
30386133       #    [MISSING]   #    30386133       #    30386133
[MISSING]    #    55053292      #    [MISSING]    #    [MISSING]
[MISSING]    #    56134222      #    [MISSING]    #    [MISSING]
56297643       #    [MISSING]   #    56297643       #    56297643
61840853       #    [MISSING]   #    61840853       #    61840853
65134562       #    65134562      #    65134562       #    65134562
77279273       #    [MISSING]   #    77279273       #    77279273
81182092       #    81182092      #    81182092       #    81182092
[MISSING]    #    81702717      #    [MISSING]    #    [MISSING]
81838668       #    [MISSING]   #    81838668       #    81838668
82696667       #    82696667      #    82696667       #   82696667
[MISSING]    #    83578837      #    83578837       #     83603672
83603672       #    83603672      #    83603672       #    84167102
84167102       #    84167102      #    84167102       #    [MISSING]
84542318       #    85952982      #    84542318       #    84542318
85952982       #    [MISSING]   #    85952982       #    85952982
96212193       #    [MISSING]   #    96212193       #    96212193
96496372       #    96496372      #    96496372       #    96496372
96834248       #    99243007      #    96834248       #    96834248
103475983     #    103318422    #    103475983     #    103475983
105779847     #    105779847    #    105779847     #    105779847
108118072     #    108118072    #    108118072     #    108118072

Remember, this is only the first 25.
In fact, counting uniques, if the first 100 searches we see overall 50 
unique DocIDs of 25 asked-for results.

So we thought.. maybe its the ordering?
Then we tried to get rid of the weight scheme we set.

And here are the results:
Total unique results  in 100 searches for the top 25 results: 25 (as it 
should be).
Total search times: 0.013 for all.
Estimates: 62 for all.

We THEN ran it on the original 'no recip' search, and here are the results:
Total unique results in 100 searches for the top 25 results (as it 
should be).
Total search times: First: 3.5sec; Most others around 0.20-0.19 (Note 
though that while overly consistent, this is still about 15 times slower 
than the other search [gle suspension])
Estimates: 7571 for all.

By the way, this is the weight scheme we used.
my $k1 = 1;             #governs the importance of within document 
frequency. Must be >= 0. 0 means ignore wdf. Default is 1.
my $k2 = 25;            #compensation factor for the high wdf values in 
large documents. Must be >= 0. 0 means no compensation. Default is 0.
my $k3 = 1;             #governs the importance of within query 
frequency. Must be >= 0. 0 means ignore wqf. Default is 1.
my $b = 0.01;           #Relative importance of within document 
frequency and document length. Must be >= 0 and <= 1. Default is 0.5.
my $min_normlen = 0.5;  #specifies a cutoff on the minimum value that 
can be used for a normalised document length - smaller values will be 
forced up to this cutoff. This prevents very small documents getting a 
huge bonus weight. Default is 0.5.
$XAP_Enquire->set_weighting_scheme(Search::Xapian::BM25Weight->new($k1,$k2,$k3,$b,$min_normlen));           

Maybe its important to specify here that we index the title of the 
document with a higher weight (x5)

Then we went back to the old server.. Same speed as before (0.9-1.0sec 
per search) and this time estimates are stable. So, weight scheme is the 
cause of the inaccurate estimates. Why?

Overall this is very interesting..

1. Slow search tome for no-recip is still an issue. I am hoping the 
OProfile output will shed a light on this. Looking forward to hearing 
from the experts here..

2. I looked again into the docs about weight scheme and didn't see 
anything about changing it or about inaccuracy of results and estimates 
in such cases. Can anyone clarify?

3. Compared to the other server, 16GB mem over single SATA2 disks (five 
of them, each containing 2 databases),  a single quad CPU... would you 
expect such a big performance difference? of 1sec vs 0.2sec for repeated 
searches? Also, why is it still slow? (0.2sec for exactly the same 
search that was performed a second ago, when its all the server is doing 
and it should be cached.... sounds slow, isn't it?)

Anyway..
your thoughts?

Best regards,
Ron