[Xapian-discuss] Indexing more than 15 billion documents

Sun Jun 17 12:05:03 BST 2007

Ycrux wrote:
> Hi there!
> 
> My company has more than 15 billion of web documents.
> We're archiving all the web (like Internet Archive).
> Last spring, we used Lucene to have a full text search capability
> but we faced a limit at 500 million documents.

Out of interest, was that a hard limit (ie, the system doesn't allow 
more documents than that) or a soft performance limit (ie, performance 
became unacceptably slow at that size)?

> We're now considering XAPIAN. Could anybody share experiences
> in this kind of huge dataset?

You should probably take a look at 
http://www.xapian.org/docs/scalability.html

I've not recently worked on any dataset larger than the 500 million 
document database mentioned in that document.

In particular, there is currently a limit of 4 billion documents in a 
database, due to using a 32 bit type for document IDs, but I don't think 
it would be particularly hard to change to using a 64 bit type here (the 
database format might require an incompatible change, but this could be 
managed).  I, and I'm sure the other Xapian developers, would be happy 
to work with you on fixing that.

The other limits mentioned at the end of scalability.html are less 
likely to be an issue, but given a reason and some time I'm sure we 
could fix the B-tree table maximum size and the storage of the total 
length, too.

-- 
Richard