[Xapian-discuss] using Xapian as backend for google

Felix Antonius Wilhelm Ostmann ostmann at websuche.de
Mon Dec 11 08:56:10 GMT 2006


Olly Betts schrieb:
> On Thu, Dec 07, 2006 at 10:02:03AM +0100, Felix Antonius Wilhelm Ostmann wrote:
>   
>> know i must figure out how we can use xapian in the best way. generating 
>> many flint-indexes so we can generate it fast on many machines and merge 
>> it. the frontend will be a webserver with apache and mod_perl ... is it 
>> the best way to run xapian-tcpsrv on other maschines as backend? i think 
>> so ... or is another webserver with mod_perl and perl-bindings the ideal 
>> solution? My question: can someone tell me something about building the 
>> backend for the next google? :) what is important? 
>>     
>
>   
>> Raid0 VS Raid1
>>     
>
> RAID 1 should be faster for reading, and actually has redundancy so it
> can survive a disk dying, but you get half as much storage volume from
> the same disks.  In other words, it'll cost about twice as much.
>
> Incidentally, there are many more RAID configurations than just these
> two.  Wikipedia has an overview:
>
> http://en.wikipedia.org/wiki/RAID
>   
after one weekend i think raid is the wrong way ... split the index to 
different drives would be faster and we dont lost the space :)


>   
>> SCSI VS SATA
>>     
>
> It depends on budget and how big you want to grow.  SATA is cheaper and
> probably similar in speed to where SCSI was a few years ago, but iSCSI
> and Fibre Channel are likely to end up faster in most cases.
>
>   
>> many smaller backends VS some big backends?
>>     
>
> There are definitely downsides to having too many backend servers.  But
> if you have a lot of data, splitting a search over several machines can
> be a win.  You'll need to profile if you want to find the sweet spot for
> your setup, but I'd think it's likely to be nearer a few than a few
> hundred.
>
> Note that there's some overhead to using the remote backend, and also
> some to using multiple databases.  Another possible architecture is
> to just have several servers searching replicated copies of a single
> large database.
>
>   
>> What would be the bottleneck (i think DISC I/O)?
>>     
>
> It's likely to be.  Note that there's scope for improving matters with
> enhancements to Xapian here - there are some obvious things to improve
> (which I'm working my way through), and profiling should reveal more.
> For a large operation, it's worth investing some time in such fine
> tuning as it can seriously reduce the amount of hardware you need to buy
> and house!
>
>   
>> Is the xapian-tcpsrv the best way? Can anyone tell me something about
>> such an project?
>>     
>
> Webtop used xapian-tcpsrv to spread searches over a number of boxes
> (10 or so IIRC).  The index size was around 500 million documents, but
> with modern hardware that's much less of a challenge than it was more
> than 6 years ago.
>
> Also the remote backend has been completely rewritten since then, and
> the local backend Webtop used was the legacy "muscat36 da" one, which
> flint should outperform by some margin.
>
>   
>> One other questions: "similar results from one domain".
>> How can we arrive that goal? The MatchDecider watch over the values with 
>> the domainname and accept only two documents from one domain? Is that 
>> the way?
>>     
>
> If you just want two documents from any one domain, it wouldn't be hard
> to extend the collapse feature to leave N documents behind instead of
> just one.
>
> Only collapsing "similar" results is harder - first you need to decide
> how to define "similar" I guess.
>   
Hmmm ... the problem is, that one domain can include 1oo.ooo or more 
documents. When a search match 2o.ooo documents from this domain, the 
MatchDecider must access 2o.ooo values (with the domainname) and decline 
19.998 documents. And perhaps the next domain has another 1oo.ooo 
documents with 15.ooo matches. i dont know :( is the MatchDecider the 
right way? Or should i perform more than one search at xapian to get the 
right results? i dont see any solution for this problem yet :(


> Cheers,
>     Olly
>
>
>   
thanks,
       Felix :)



-- 
Mit freundlichen Grüßen

Felix Antonius Wilhelm Ostmann
--------------------------------------------------
Websuche   Search   Technology   GmbH   &   Co. KG
Martinistraße 3  -  D-49080  Osnabrück  -  Germany
Tel.:   +49 541 40666-0 - Fax:    +49 541 40666-22
Email: info at websuche.de - Website: www.websuche.de
--------------------------------------------------
AG Osnabrück - HRA 200252 - Ust-Ident: DE814737310
Komplementärin:     Websuche   Search   Technology
Verwaltungs GmbH   -  AG Osnabrück  -   HRB 200359
Geschäftsführer:  Diplom Kaufmann Martin Steinkamp
--------------------------------------------------




More information about the Xapian-discuss mailing list