[Xapian-discuss] Theoretical question

Olly Betts olly at survex.com
Fri Feb 3 03:52:24 GMT 2006


On Mon, Jan 16, 2006 at 06:53:05PM -0600, Chris wrote:
> I've been reading the docs on the internal construction of Xapian. There's 
> discussion of autopruning and operator decay in the Matching section.
> 
> Elsewhere, though, it says that postings lists are stored in doc_id order, 
> instead of wdf order, which suggests that there could be high-ranking 
> documents at the end of a postings list.
> 
> How can autoprune and operator decay really have much effect, then? You 
> would almost always have to go to the end of every list.

The matcher will need to go to the end of the posting list for one term,
but at that point operator decay and autopruning can start to happen.

Also, for some operators (e.g. AND) the matcher can use skip_to on
the posting lists.  Often that avoids even needing to look at many
chunks of a long posting list.

We've not really looked in detail at how often the various optimisations
actually fire and how much work they save when they do.  But empirically
the combined effect is a marked speed-up.  Coincidentally I'm hoping to
collaborate on a paper studying this.

> Example: let's say we have 1000 documents, and we need to return the top 10 
> for a single-word query.

A single term query has no query operators, so operator decay can't
happen.

> On average, the top 10 will be scattered uniformly across a postings
> list which is sorted in doc_id order, which means that at least one of
> them will commonly be found 90% or 95% of the way into the list. 

Yes, in this case Xapian's matcher will simply scan the whole of that
term's posting list.  Unless you store the posting lists sorted into
decreasing wdf order, that's unavoidable.

Sorting is a possible alternative approach, though document length
factors into the calculation so you can't simply take the first 10
entries from the posting list for a single term query.  Also update is
harder if posting lists are sorted by wdf (delete in particular).  And
we can't compress using the very effective delta-docid scheme we
currently use.

A hybrid approach might work, storing "high" wdf postings in separate
lists to "low" wdf ones, perhaps with multiple bands.

But I'm resistant to making major changes like these without evidence
that they'll improve performance.  If you're interested in investigating
I'd suggest prototyping a scheme you think might work better and then
we can compare it with the current approach.

Cheers,
    Olly



More information about the Xapian-discuss mailing list