[Xapian-discuss] b+tree to cdb
gervin23
gervin23 at fastmail.fm
Mon Sep 27 09:46:27 BST 2004
>>i've been performing some tests with xapian on a set of docs numbering
>>86801 with average length of 291.895. i've also removed stop words to
>>help with speed and size but i'm noticing some serious performance hits
>>when it comes to phrase searches, sometimes hitting +14s as compared to
>>~0.01s i'm getting with most boolean searches.
>
>
> Is this in real use, or when running test queries after building the
> database? One effect to be aware of is that queries will be a lot
> slower when nothing is cached. In real use, pretty much all the
> non-leaf blocks from the B-trees being used for the search will be
> cached pretty quickly, as well as many commonly used leaf blocks.
>
this was querying just after building a fresh index. yeah, i did notice
significant speedups the more i used it, though i wasn't sure what was
going on :) this explains it, thanks.
> Are the phrase queries that are slow just a phrase, or combined with
> other terms? For queries with a phrase and other terms, we could defer
> the phrase check in some cases to reduce how often we actually need to
> do it (there's a bugzilla entry for this).
>
> There's also a plan to tweak the indexing to eliminate some common cases
> which are currently phrases but don't really need to be (e.g. "e-mail").
>
these particular tests were pure phrases i.e. "president johnson".
>>the following quote from the documentation has peaked my interest:
>>"implementing an ultra-compact read-only backend which would take a
>>quartz b-tree and convert it to something like a cdb hashed file" ...
>>"If you need serious performance, implementing such a backend is worth
>>considering.". firstly, would this help with phrase searches and
>>secondly, what are the details in implementing such a beast? i don't
>>need write access once indexed so this may be something i'd like to try.
>
>
> It would help everything a bit just by reducing the amount of I/O
> required. A large search system is going to become I/O bound
> eventually, and at that point you win by reducing the amount of I/O and
> improving the useful information in every block cached.
>
> But first, have you run the database through quartzcompact? This
> produces Btrees which are fully compact and at revision 1 - they're
> smaller and revision 1 means they're inherently a bit more efficient to
> access.
>
yep, i ran quartzcompact which does speed things up a bit but i never
did any pre-compact to post-compact timing tests.
> Also relevant are some plans I've been working on for revised encoding
> schemes to replace those currently used by quartz. Some tests I did
> showed the new positionlist encoding to be ~15% more compact than what
> we currently use, and we can save on the other tables too (more on some
> less on others) which will all help.
>
awesome.
> Anyway, as to the the question of producing a cdb backend - it isn't a
> huge undertaking as a lot of quartz can be reused by using the same
> encodings quartz uses, and iterate through all the key/tag pairs in each
> Btree (the code in quartzcompact.cc shows how to do this) and put these
> into a cdb (or whatever format).
>
> Then a bit of glue code is needed to reimplement the bits of the Btree
> external API which are used to retrieval.
>
> Looking at the cdb API, it's possible cdb isn't actually suitable as we
> need to be able to support a cursor which can move to the next and
> previous keys. The alternative would be to not chunk the posting lists,
> which is probably going to hurt performance. Or store all chunks for a
> term with the same key in the cdb (they allow multiple entries with the
> same key), but that may make skipping over chunks expensive as we'd need
> to read each chunk to check if it was the one we wanted...
>
> Hmm, I also suspect the licence (which appears to be "You may distribute
> unmodified copies of the cdb package.") isn't GPL compatible. So a cdb
> backend would be fine for in-house use, but we couldn't distribute the
> cdb library itself with Xapian, and nobody could distribute cdb-enabled
> Xapian packages.
>
> Also cdb is limited to 4GB per table, which is going to be too
> restrictive for high end users.
>
> So "something like cdb" might need identifying first!
>
> I'm not sure quite what the saving would be. You could calculate the
> exact size of the cdb correspoding to a particular database - just
> iterate over the key/tag pairs, and total up the sizes of each, plus 24
> bytes per pair, plus 2048 bytes per table. But I've not tried this.
>
> I suspect quartzcompact plus the new encodings will make more difference
> than converting to cdb. But you could use the new encoding in a cdb
> too, and get the gains from both!
>
yeah, this sounds like a job for down the road :) i'm already very
impressed with xapian using my little dataset but was wondering what
kind of possiblities there might be with 10+ million docs when something
that big comes my way. certainly splitting the indexes up will help as
mentioned in the help docs.
>
>>one other question if i may. how would one get just the record numbers
>>for the entire estimated MSet? my html layout at the moment is two
>>frames, a table of contents on the left and search/document on the
>>right. in order to show (in the toc) which folder/subfolders/documents
>>have hits, i'm having to issue 'enquire.get_mset(0, 1000)' to shelve the
>>first 1000 record numbers then issue another 'enquire.get_mset(0, 10)'
>>to display the first ten, twenty, etc... of course, the bottleneck is
>>the first get_mset so removing that step would help wonderfully.
>
>
> But get_mset only actually gets the document ids and weights - document
> data, etc is lazily fetched, so if you don't ask for it, you don't incur
> any extra overhead.
>
> Getting 10 matches is faster not because we're reading less document
> data but because we're able to deploy our shortcutting optimisations
> more often and more effectively.
>
after issuing get_mset(0,10), are all the document id's thrown out (or
unavailable) after the first 10 or are they retrievable via API? i don't
need data, values, etc.., just the entire set of matching id's.
>
>>hardware-wise, i'm running a Pentium III (Katmai) @ 551MHz with 512MB
>>RAM and 7200 (hdparm reports 39.45 MB/sec).
>
>
> How big are the database tables (i.e. what's the output of ls -l on the
> database directory)?
>
445MB after quartzcompact.
> Cheers,
> Olly
thanks much for all your help,
andrew
More information about the Xapian-discuss
mailing list