[Xapian-discuss] MatchSpy:ing on a large recordset

Olly Betts olly at survex.com
Sat May 31 13:29:14 BST 2008


On Thu, May 22, 2008 at 12:24:50PM -0700, alexander lind wrote:
> 
> On May 22, 2008, at 2:55 AM, Olly Betts wrote:
> 
> > On Wed, May 21, 2008 at 11:03:04PM -0700, alexander lind wrote:
> >> I have a project in the works that will have a 10-15M records with a
> >> set of arbitrary attributes on each record.
> >>
> >> I need to build a system where a user can filter the recordset by
> >> selecting attribute values and/or negating on them, and for each
> >> attribute value given, the amount of matching records needs to be
> >> calculated in realtime - 1-2 seconds lookup time is acceptable.
> >
> > For the filtering options you describing, making each attribute a
> > term prefix and filtering on those terms would be the most efficient
> > approach I think.
> 
> For attributes that can be applied as values, would it be faster to  
> put them in values instead?  Like for example the attribute age, which  
> could be a value between 1-100.

Any attribute which can be represented as prefixed terms could be put in
a value.  But (at least currently) terms are going to be faster for
simple checks (which seems to be what you're describing above).  A value
does better when you have a quantity with a lot of values and want to
perform a range search or calculate a geographical distance or similar.
Something like age might be like that if you want to be able to say "18
or older".  

But that wasn't what I understood you to mean by "selecting attribute
values and/or negating on them".  If you want to be able to say "age ==
18" or "age != 18", then a term like XAGE18 will be best.  For small
ranges, you can OR values together.  It would probably be reasonable to
handle any age range like this (but it would also be reasonable to use
a value).  For quantities with millions of possible values, the OR
becomes too large to be sane, and a value will definitely be better.

An alternative is to add terms to represent larger granularities, like
Omega's old date range code which adds terms for whole years and months,
and then builds a range using the month and year terms to cover whole
months and years in the range.

> >> Can this be achieved with Xapian and the MatchSpy functionality?
> >
> > You certainly could do it this way.
> 
> Do you think there is a better way to do it with Xapian?

Using prefixed terms, as I suggested above.

> >  If there's enough RAM to cache
> > all the value data, you'll probably at least be near the performance
> > target, but without trying it I couldn't say for sure.
> 
> Would it be of significant use if I had enough RAM to put the entire  
> xapian index in a RAM partition?

I'm not sure how OS VMs handle caching of RAM partitions.  Ideally they
wouldn't bother caching data read from files in them, but that may not
be easy to achieve, in which case you'll end up with two copies of that
data in RAM (and would need twice the size of the database in RAM).

So you may be better off just ensuring that there is enough RAM to hold
the whole Xapian database, and letting the OS use it to cache it.

> > Using C++ here
> > is likely to help - calling from C++ to a scripting language and back
> > tens of millions of times will probably be a measurable overhead.
> 
> You mean for when updating the recordset here right?

No, at search time the MatchDecider needs to be called for each
potential match.  I've not profiled this overhead for different
languages, but it's likely to be a lot higher than the overhead of a C++
virtual method call.

Cheers,
    Olly



More information about the Xapian-discuss mailing list