[Xapian-discuss] Indexing more than 15 billion documents
Richard Boulton
richard at lemurconsulting.com
Sun Jun 17 12:05:03 BST 2007
Ycrux wrote:
> Hi there!
>
> My company has more than 15 billion of web documents.
> We're archiving all the web (like Internet Archive).
> Last spring, we used Lucene to have a full text search capability
> but we faced a limit at 500 million documents.
Out of interest, was that a hard limit (ie, the system doesn't allow
more documents than that) or a soft performance limit (ie, performance
became unacceptably slow at that size)?
> We're now considering XAPIAN. Could anybody share experiences
> in this kind of huge dataset?
You should probably take a look at
http://www.xapian.org/docs/scalability.html
I've not recently worked on any dataset larger than the 500 million
document database mentioned in that document.
In particular, there is currently a limit of 4 billion documents in a
database, due to using a 32 bit type for document IDs, but I don't think
it would be particularly hard to change to using a 64 bit type here (the
database format might require an incompatible change, but this could be
managed). I, and I'm sure the other Xapian developers, would be happy
to work with you on fixing that.
The other limits mentioned at the end of scalability.html are less
likely to be an issue, but given a reason and some time I'm sure we
could fix the B-tree table maximum size and the storage of the total
length, too.
--
Richard
More information about the Xapian-discuss
mailing list