[Xapian-discuss] Filter similar results

Robby Walker robby.walker at gmail.com
Tue Sep 12 20:35:42 BST 2006


Hi,

I'm trying to implement something like Google's "similar results
omitted" and I'm not sure how to go about it.

Specifically, what I have are a number of sets of documents, and I'd
like to get the best documents from *each* set.  So, making up some
relevance scores, say I have documents from set A with scores 10,9,8,7
and from set B with score 6,5.  If I am only to return 3 documents to
the user I'd like to return 10,9, and 6.  i.e. I'd like to filter out
the similar documents 8 and 7 from A.

What's the best way to go about this?  I've come up with a few ideas:

One option is to do multiple smaller searches - one per set - this
seems inefficient?  Is the cost of search weighted towards setting up
the search or is it more affected by the number of desired results?
What happens when I'm searching 20 or more different sets?

Another option would be to use MatchDecider and only "OK" the first
few documents from each set.  Is it true that documents will be tested
by MatchDecider in any sort of order - e.g. highest relevancy first?

A third option is to do a standard relevancy search over all sets.  If
one set dominates this result, I can search again filtering out that
set and merge the results.  Repeat if necessary.  Again the question
here is what part of search is most expensive.

Finally, the fourth option is to just return way more documents than I
need and then go filter it manually.

Any of these seem good?  Or am I missing an easier way to do this?

Many thanks in advance!

- Robby



More information about the Xapian-discuss mailing list