[Xapian-discuss] Indexing more than 15 billion documents

Olly Betts olly at survex.com
Mon Jun 18 04:11:54 BST 2007


On Sun, Jun 17, 2007 at 12:05:03PM +0100, Richard Boulton wrote:
> You should probably take a look at 
> http://www.xapian.org/docs/scalability.html
> 
> I've not recently worked on any dataset larger than the 500 million 
> document database mentioned in that document.
> 
> In particular, there is currently a limit of 4 billion documents in a 
> database, due to using a 32 bit type for document IDs, but I don't think 
> it would be particularly hard to change to using a 64 bit type here (the 
> database format might require an incompatible change, but this could be 
> managed).

The code was written with the intention that you should just be able to
increase the sizes of the types and recompile, but that's not actually
ever been tested to my knowledge.

Document ids are encoded in a variable length encoding, so the database
format shouldn't be broken just by increasing this size, but it obviously
would be when you actually use docids >= 2^32.  But anyway compatibility
is only an issue if you have existing databases, or need to be able to
exchange them with others.

> The other limits mentioned at the end of scalability.html are less 
> likely to be an issue, but given a reason and some time I'm sure we 
> could fix the B-tree table maximum size and the storage of the total 
> length, too.

We should probably consider what useful Btree block sizes are.  It's not
something I've done much benchmarking of, but I doubt that 2K is ever a
good choice, and I suspect 8K is probably too low in many cases for
modern hardware.  Also it might be useful to be able to go beyond 64K
blocks.

Cheers,
    Olly



More information about the Xapian-discuss mailing list