[Xapian-discuss] Unique Term Listings
Martin Hearn
martinhearn at mac.com
Fri Nov 17 17:39:25 GMT 2006
Thanks Olly, this is great, and does exactly what I want. I'll test
it over some large data sets over the weekend.
Am I right in thinking this won't work with the PHP5 bindings?
Cheers
Martin
On 15 Nov 2006, at 19:23, Olly Betts wrote:
> On Tue, Nov 14, 2006 at 11:19:35AM +0000, Martin Hearn wrote:
>> I'm trying to achieve something similar to what ebay does.
>> E.g. If I search for "harry potter" on ebay I get a list of over 50
>> categories with the number of entries for each category.
>
> An issue here is that you presumably want a full list of the
> categories
> with exact counts in each. If you want a full list of categories,
> then
> query expansion isn't going to work anyway - it's good for finding the
> best N categories.
>
> Xapian is built with the idea that we want to do as little work as
> possible while still returning the best N results. So it will try
> not to look consider every document. For example, the query "A OR B"
> can often be ended when we reach the last document indexed by one
> of the terms, if we can see that the maximum weight we could get
> from the other term is lower than the lowest weight in our candidate
> set.
>
> But to get a full list of categories and counts, you need to consider
> all matching documents. So while you can achieve this, you shouldn't
> be suprised if it's slower than just searching for 10 matches.
>
> The first thing you'll need to do is put the category (or category
> list
> if there are several) into a document value at index time.
>
> Then the trick is to use a MatchDecider to spy on the matcher (rather
> than to decide which matches are wanted - it always says "yes" for
> that). Our "spy" will simply check the value with the category in
> and update a std::map with counts (if categories are numeric, this
> could just be an array). If there are multiple categories, then
> just list them in the value (e.g. tab separated) and split and tally
> each in the spy.
>
> It's easier to code this than explain in words, so here's a patch
> against the "quest" example to show what I mean:
>
> http://www.oligarchy.co.uk/xapian/patches/match-spy-for-
> categories.patch
>
> When searching, we set check_at_least to some high value when calling
> get_mset so that the matcher will consider at least that many matching
> documents before giving up early. You could use (Xapian::doccount)-1
> here, but I'd suggest imposing a sane limit here since you probably
> don't really want to build an exact set of categories for more a
> million
> matches. So I impose a limit of a million and say "274+" instead of
> "274" next to category counts if we hit this limit.
>
> While testing this, I've noticed that check_at_least seems to actually
> return all the matches (which it shouldn't) so this technique may be
> faster than the patch currently suggests. I'll take a look and fix
> the
> check_at_least handling if it is misbehaving.
>
> Cheers,
> Olly
More information about the Xapian-discuss
mailing list