[Xapian-discuss] Incremental indexing limitations
Ron Kass
ron at pidgintech.com
Thu Oct 11 17:51:50 BST 2007
Assuming we have a server with 2GB memory and 500GB disk.
We want to use it for indexing of an always updating database of documents.
And lets say each document is 2KB of text.
We have a process that constantly indexes data into a Xapian database.
This process flushes updated every 10K documents (to make sure they are
searchable and successfully stored) and after such an update, marks the
documents as indexed.
Few observations until now.
1. Size of such a database will actually be about 3K per document. That
is bigger than the text of the documents themselves. This is quite
surprising really, as what Xapian is supposed to be storing is just the
term IDs and the positions. Any ideas why its bigger than the original
size as opposed to 1/3 of the size lets say?
2. Such an indexing process will start very fast, but when reaching a
database size of 2M or 3M documents, each flush will take 2-4 minutes.
This is already very slow for a flush of 10K small documents. Changing
flushes to every 1K document doesn't help. It seems the time a flush
takes is not related so directly to the size of the flush itself but
does strongly relate to the size of the database itself. How come? What
happens during a flush?
3. If the time it takes to flush 10K documents when the DB is 3M
documents, is 2.5 minutes, does it actually mean that when the database
is 100M documents, each flush will take over an hour? If so, that is
extremely "painful', isn't it?
So.. thinking about how to maybe optimize this process, the concept of
using a "live" small database for updates and then merge it into a big
database comes to mind.
However, there are two issues here:
1. Such a merge is slow. It will take quite a lot of time (many hours)
to compact/merge each such live database into a main one. If this
process has to be done hourly lets say, and the process takes more than
an hour, we are faced with a critical problem.
2. Such a merge process seems to take quite a lot of resources from the
system, limiting CPU, I/O and memory for the more urgent task.. indexing
and searching.
3. It also means that we can never use more than 50% of the diskspace on
the server. In fact less than 40% or 45% to be safe. This is because the
compact process is merging the big database with the small one, into a
new database. So the new database will be bigger than the original big
one. So just because of this process, the server diskspace can not be
utilized effectively.
Any thoughts and insights about the matter are greatly appreciated.
Ron
More information about the Xapian-discuss
mailing list