[Xapian-discuss] Returning "fresh" results only from multiple DBs

Olly Betts olly at survex.com
Thu Jan 15 13:05:50 GMT 2009


On Wed, Jan 14, 2009 at 10:02:37AM +0200, Henry wrote:
> How can I perform an enquiry, collapsing on a key (as currently done) to
> remove duplicate pages, but yielding the freshest of those duplicate pages?

Collapsing picks the highest ranked document, but the frequencies of
terms may increase or decrease and the document length may change when a
document is modified, so the older version may score more highly for
relevance for some queries.

You could sort by "revision" and then relevance, but then any revised
document would beat any other unrevised document (and any more revised
document would beat any less revised one).

As a side issue, collapsing chooses between matching documents, but a
term could be removed in a newer revision of a document, so even if the
above was resolved somehow, you might get an old version of a document
when it is the newest version which matches.

I think the only way you could do this would be to keep a list of the
document ids of all replaced documents and combine this with every query
using AND_NOT and a PostingSource subclass (which requires SVN trunk
currently).

> I know we can perform updates on DB1, but I don't want to go down that
> path because of the volumes/sizes involved.

I think doing this at index time is the better way to go.

If the issue is that this is currently slow, there's probably scope for
improving it - for example:

http://trac.xapian.org/ticket/250

Cheers,
    Olly



More information about the Xapian-discuss mailing list