[Xapian-discuss] Xapian and 10M (small) documents. What to expect?

Bart van Bragt xapian at vanbragt.com
Thu Sep 8 11:44:14 BST 2005


I've been thinking about integrating phpBB with Xapian for quite some 
time now and I guess I really should start to get things rolling. I 
haven't had a decent search on my site (www.bokt.nl) for ages now
and the users are getting pretty annoyed by that fact :)

I'm currently trying to figure out how I'm going to set this up, I 
probably also first need to get some new hardware to facilitate search. 
Does anyone have an idea what kind of hardware I would need to
search 10 million documents (approx 4GB of text) with approx 10.000 new 
postings per day?
Real-time  indexing is nice but I can also batch this up so we can do 
this during the night (servers are mostly idle during the night: 
http://status.bokt.nl/ ).

Do I need a dedicated machine for searching? The site isn't exactly 
generating huge amounts of money so it would be very nice if we could 
use a (beefy) server to do both webserving and searching. Or
combine the database and the search but I don't think those two combine 
really well, I'm guessing that the main bottleneck is going to be I/O?

Does anyone have experience with integrating Xapian with (PHP) forums? I 
know Arjan has plenty of experience with gathering.tweakers.net :D 
Talking about which... I'd very much prefer to index individual postings 
instead of combining all posts in a topic to one document. The main 
reason for this is that combining large topics results in lots of hits 
on those large topics because they contain a LOT of search terms. This 
is my main grief when searching on gathering.tweakers.net, you have to 
wade through lots of 300 page topics that do contain your searchwords 
but in quite separate postings on separate pages. Most of the times 
those 300 page topics have no link at all with the subject that
searching for. IMO searching in postings instead of topics should solve 
that problem. The main drawback is a (very significant?) performance 
loss I guess... Indexing topics would result in only 500k documents 
instead of 10M.

There seems to be a fairly large resemblance between gmain and phpBB 
indexing (both are about indexing topics/threads and lots of small 
postings). Is the gmane setup going to be public? Is it already
known what hardware this system is going to need?

Thanks in advance!

Bart van Bragt



More information about the Xapian-discuss mailing list