[Xapian-discuss] Xapian and 10M (small) documents. What to expect?

Arjen van der Meijden acmmailing at tweakers.net
Fri Sep 9 06:58:02 BST 2005


On 8-9-2005 12:44, Bart van Bragt wrote:
> I'm currently trying to figure out how I'm going to set this up, I 
> probably also first need to get some new hardware to facilitate search. 
> Does anyone have an idea what kind of hardware I would need to
> search 10 million documents (approx 4GB of text) with approx 10.000 new 
> postings per day?
> Real-time  indexing is nice but I can also batch this up so we can do 
> this during the night (servers are mostly idle during the night: 
> http://status.bokt.nl/ ).

Real-time indexing will not allow you to use the faster-to-search 
compacted databases. Database-compaction takes an hour or so with our 
database. Which goes down from about 15G "working" to 11G "compacted" in 
the Flint format.
So that is another reason not to index real-time.

> Do I need a dedicated machine for searching? The site isn't exactly 
> generating huge amounts of money so it would be very nice if we could 
> use a (beefy) server to do both webserving and searching. Or
> combine the database and the search but I don't think those two combine 
> really well, I'm guessing that the main bottleneck is going to be I/O?

I'd suggest a dedicated machine. We have been running it on a webserver 
with 2G of memory a while back, but especially the phrase searches were 
very slow. With your per-posting set-up, the data to sift through per 
phrase search will probably be smaller though. The more memory, the 
better, cpu's aren't very interesting but your disk-system is.

In our recent .plan you can read what our next search machine will be:
http://www.tweakers.net/plan/292

Which is probably currently overspecified, but it will be used to 
facilitate another search database and is expected to cope with the 
growth in size and features for at least three years.

> Does anyone have experience with integrating Xapian with (PHP) forums? I 
> know Arjan has plenty of experience with gathering.tweakers.net :D 

We don't use the php-bindings. In the beginning we'd convert the 
GET-parameters to Omega-compatible ones and then just call the 
Omega-application to do the hard work for us. The result of Omega was 
formatted with a nicely fitted query-template allowing us to easily 
interpret that in PHP.
Currently we have one machine with Omega running behind a 
xinetd-superdeamon and our webservers interface with that using TCP/IP, 
but it basically is the same as calling the local application.

In my experience that easily beats the old "remote database" in terms of 
performance, since that used to send all result-data over the line 
expecting the client to sort the results.
Whether it still beats the current remote-setup I don't know, but we're 
not just going to change a working set-up to figure that out ;)

> Talking about which... I'd very much prefer to index individual postings 
> instead of combining all posts in a topic to one document. The main 
> reason for this is that combining large topics results in lots of hits 
> on those large topics because they contain a LOT of search terms. This 
> is my main grief when searching on gathering.tweakers.net, you have to 
> wade through lots of 300 page topics that do contain your searchwords 
> but in quite separate postings on separate pages. Most of the times 
> those 300 page topics have no link at all with the subject that
> searching for. IMO searching in postings instead of topics should solve 
> that problem. The main drawback is a (very significant?) performance 
> loss I guess... Indexing topics would result in only 500k documents 
> instead of 10M.

In my personal experience those large topics aren't that usefull as 
search results indeed, that's why the within-document-frequency will 
likely push them down the search-result-list if they are really useless. 
Then again, you can search within that topic when you want to be sure it 
really does(n't) contain your terms.
My main concern with per-posting searching is that you'll end up with 
lots of small fragments of a document, which may result in not being 
able to find a certain topic because the terms you specified were 
scattered over the seperate postings.

Good luck.

Arjen



More information about the Xapian-discuss mailing list