[Xapian-discuss] PHP Fatal error while indexing Wikipedia

Wed Jan 2 20:15:40 GMT 2008

On Wed, Jan 02, 2008 at 07:13:09PM +0000, Robert Young wrote:

> - The position and postlist files seem to be growing at a tremendous
> rate. The indexer hasn't even got past the first 2.0Gb chunk and
> already both the position.DB and postlist.DB are each over 1.2Gb. I
> have tried to find out exactly what each of the files does but haven't
> had much luck. A brief addition to each of the table pages on the wiki
> on what the table actually does would be really helpfull and
> gratefully recieved.

Post list is the index that goes from a term to the list of documents
that term indexes. Position list is the list of positions within a
given document that a term appears at.

Disabling positions in your index will remove the need for the
position list. You can't avoid the post list, as it's the main thing
you need :-)

If you have a large number of unique terms being generated, you'll get
a large database. There may be something to do with your term
generation that's unexpected here - you can dump a list of terms with
a little PHP script to find out what's going on, perhaps. (Maybe run a
couple of documents only in and see if you're getting an expected list
of terms.)

> - As the index gets bigger the disk gets hammered. Now, obviously
> this is to be expected to an extend but things are getting really
> bad, looking at 90-95% cpu waiting on IO. I'm guessing this is in
> part due to the fact that I'm doing this on my laptop with it's
> crappy laptop disk

Partly of course that you're probably using one fairly slow spindle
for both reading the wikipedia data and writing the database (with
four or so tables). You probably don't have enough main memory to let
write-behind take care of the database tables efficiently - explicitly
flushing more often (or lowering the default flush threshold) may help
here.

I assume you aren't actually swapping out the indexing process? If you
are waiting too many documents to flush, there's a danger that the
index process code will be fighting with its data (and the kernel
buffers) for memory. If you're in that situation, again lowering the
default flush threshold may work, but other than that or buying more
memory you may simply be stuck.

> and partly due to using replace_document so that it has to do a
> query on each update. Is there any way of making queries optimized for
> querying uids? Would having an auxhiliary index just for uid to docid
> lookups help so that I only need call replace_document on documents I
> know are in the index?

I don't actually know how replace_document works precisely when given
a unique identifying term (which is what I assume you mean by
UID). What it'll do under those circumstances is to check the posting
list for that term; it should be pretty fast at *finding* the entry in
the posting list (because that's kind of the point of Xapian's backend
:-), but will slow down dramatically if you can't get all the relevant
btree blocks into memory. Specifically, if you can't keep all the
'trunk' blocks that govern the 'U'-prefixed area (assuming you're
using 'U' as your unique term prefix) in memory, this is going to be
horrendously slow.

Note that even if you avoid the replace_document() call somehow that's
memory efficient, you still aren't going to index fast if you can't
keep the trunk blocks of the posting list in memory, because you're
going to need a lot of them in order to write a new document to
disk. (On write, some of them may then become unused, but that's fine
- again providing you have enough memory.)

> - Indexing performance really really drops off as the index grows.
> It's not great at any rate as it's running on my laptop but it's been
> running for over 12 hours now and it's still not indexed the first 2Gb
> chunk. I'm guessing this is related to the second point.

It may be. If you're sitting in iowait 90-95% of the time, you're
basically not doing anything. iostat(1) (invoked as say `iostat -x 2`)
on most Unixoids will verify that it's your disk getting thrashed (and
will give you an idea of svctm and await or similar, the average time to
service an IO request and the average time spent from entering the
wait queue to complete of IO service, which probably won't help in
this case but is often useful to know).

Something like slabtop(1) will let you look at the usage of various
memory caches and buffers, if you're on linux (sorry, can't remember
if you've said this). If you've run out of core for the OS buffering
to work efficiently, that may help you track down where it's gone and
come up with a way round it.

At the end of the day, though, indexing large quantities of data on a
laptop is ambitious :-)

J

-- 
/--------------------------------------------------------------------------\
  James Aylett                                                  xapian.org
  james at tartarus.org                               uncertaintydivision.org