[Xapian-discuss] time to build index

Kevin Duraj kevin.softdev at gmail.com
Tue Nov 4 22:47:33 GMT 2008


This is interesting problem, that have been dealing with now for
couple of years and I'm glad that it is coming to mailing list instead
of me describing the issue. I use one index only, and has anywhere
between 50-75 million documents. I am using
XAPIAN_FLUSH_THRESHOLD=1,000,000

The first million to index takes around 20 minutes then it increases
to 1.5 hours when more and more documents are added into Xapian index,
and index is approaching 50 million documents in size. It is real
struggle to index 100 million documents using one hard drive. So I
create 100 indexes by approximately 1 million documents and then merge
them. Now it works like a charm on RAID of hard disks, despite the
merging takes time, but less than creating one index only.

Kevin Duraj
http://myhealthcare.com/search?q=neurobiology


On Thu, Oct 16, 2008 at 7:24 AM, Jeroen van Dijk
<jeroentjevandijk at gmail.com> wrote:
> Thanks for your reply Olly. The wrong setting of 'XAPIAN_FLUSH_THRESHOLD'
> you proposed was indeed one of the reasons it took so long. One of the other
> reasons was a bad network connection and the wrong mysql gem (i'm working
> with ruby).
>
> The indexing process took 3 hours and create an index database of around
> 350mb.
>
> Now I'll see if I can get it running with my rails app :)
>
> Jeroen
>
> On Wed, Oct 15, 2008 at 3:58 PM, Olly Betts <olly at survex.com> wrote:
>
>> On Wed, Oct 15, 2008 at 02:16:15PM +0200, Jeroen van Dijk wrote:
>> > The indexing process got to 1.2 million records and then it lost the
>> > connection (my own fault i guess) after 16 hours and had built up an
>> > indexing database of around 300mb.
>> >
>> > Should I be suspicious or should I just wait a little longer?
>>
>> That seems rather slow.  It depends on the data and the hardware, but
>> I'd expect more like a million documents per hour.
>>
>> If you aren't already, try setting XAPIAN_FLUSH_THRESHOLD in the
>> environment to a value higher than the default of 10000.  The best value
>> depends on the nature of the data and how much memory you have, but
>> 1000000 is worth a try.
>>
>> I've just realised that we don't actually seem to document
>> XAPIAN_FLUSH_THRESHOLD anywhere, which probably explains why I have to
>> keep highlighting it on the mailing list!  I'll write up something...
>>
>> Cheers,
>>     Olly
>>
> _______________________________________________
> Xapian-discuss mailing list
> Xapian-discuss at lists.xapian.org
> http://lists.xapian.org/mailman/listinfo/xapian-discuss
>



-- 
Kevin Duraj
http://pacificair.com



More information about the Xapian-discuss mailing list