[Xapian-discuss] Matches estimate varies with sorting method

Olly Betts olly at survex.com
Wed Oct 17 15:34:23 BST 2007


On Wed, Oct 17, 2007 at 08:11:05PM +0800, Fabrice Colin wrote:
> On 10/17/07, Olly Betts <olly at survex.com> wrote:
> > You're likely to get a more accurate estimate when sorting since the
> > matcher generally has to consider more documents when sorting.
>
> That's fair enough. I am still surprised the figures are so wildly different.
> 
> [...]
>
> The query I am testing with is a range on another value. Does this matter ?

No, but it probably explains the poor estimates - we don't have a good
way to estimate how many times a ValueRangeProcessor will match, so
we have to set min to 0, max to db_size and we currently arbitrarily
estimate db_size / 2.

It would be good to improve this (and similarly improve MatchDecider
which has the same problem; phrase and near have a similar issue, and
currently we estimate how documents contain all the terms, then multiple
by a rather arbitrary factor to guess how many might contain the terms
in the wanted order).

We could just read the first few values, which will allow you to reduce
max, or increase min, or both and also often get a better estimate
(though in many cases they will be atypical - for example, for a dated
search, the first few will have old dates as they were indexed long
ago).  Or we could probably adjust the estimates based on the documents
we test during the match.  Another option is to add a method to
ValueRangeProcessor which allows the subclass itself to estimate based
on the values of start and end, although I suspect that's hard to
implement well for most subclasses.

> When sorting by date, the estimate is 20424, the lower bound is 7735
> and the upper bound is 40848, which is the number of documents in my index.
> When sorting by relevance, all three figures are 100.

The numbers are at least consistent with the matcher thinking there are
exactly 100 matches then.

Cheers,
    Olly



More information about the Xapian-discuss mailing list