[Xapian-discuss] making my db leaner and meaner

Sun Mar 29 11:53:12 BST 2009

On Thu, Mar 26, 2009 at 04:30:09PM +0000, Ben Campbell wrote:
> I'm trying to shrink my xapian database in an effort to reduce load on 
> the poor server (I think it's just creeping up in size enough now to the 
> point where the machine is struggling with it a bit)
> 
> My indexing is pretty naive, and I've learnt a lot since I first began.
> I suspect there is a lot of fat that could be trimmed...

It's worth taking a look at the terms indexed for each document (the
delve tool in xapian-core/examples is good for this) and seeing if
you can get rid of any junk.  It depends on the nature of the data,
but things like ASCII art, OCRed documents, files with the wrong
extensions, etc can result in terms which aren't useful for searches.

One simple thing to do is impose a sensible maximum term length if you
aren't already, or are just making sure they aren't more than 245 bytes.
Xapian::TermGenerator uses 64 bytes (currently hardcoded).

> - reduce the number of values I use.
> Currently, I'm using 6 values - most of them are only used to store 
> things I want to display in my search results. These things I'll move 
> into a serialised form in the document data (which is currently unused).
> I only ever sort using one value (a datetime), so I'll ditch the other five.

Using values as a "database field" is a (sadly too popular) mistake.
With flint (the default backend in 1.0) you have to retrieve all the
values for a document to get any of them, so if you're sorting by
datetime, you've the overhead of the other five fields getting in
the way (this is worse if datetime isn't the lowest number value).

Chert stores values separately in streams, which means you don't
get this "O(number of values)" effect, but fetching the values for
a document to display results means 5 Btree operations plus finding
the right value in each chunk, rather than one Btree operation to
get the document data, so it's still not great to use them this way.

> - look at running xapian-compact from time to time
> I add about 2000 documents per day (and almost never remove documents).
> Not sure how much this would help, but you never know, and it's easy to 
> try it out.

It saves a bit, and the database structure right after compaction is
faster to search (or supposed to be - I don't think this has actually
been benchmarked for years now I come to think about it!)

But if you then continue adding more documents, the size will soon
grow again, so if there's a constant stream of documents without many
deletions, stopping to compact isn't all that worthwhile.

> Does this all sound sane? Anything obvious I've missed?
> I was toying with the idea of ditching the positional information on 
> terms, but that would prevent me doing queries like "a walk in the 
> park", right?

Yes.  It saves a lot of space, but you lose the ability to do phrase
searches and use NEAR.

> Any other ideas welcome :-)

There's an idea to make the record table optional for those who don't
want to delete or modify documents, or use relevance feedback.

Cheers,
    Olly