[Xapian-discuss] Sanity check on database size

Jean-Francois Dockes jean-francois.dockes at wanadoo.fr
Thu Apr 6 10:32:24 BST 2006


First, I apologize for the time you spent answering my previous message. I
was looking for something stupid that I may be doing, and indeed, I was not
disappointed.

I had 'just' overlooked the fact that the document set was made of
compressed mailboxes, which get squeezed very well. The actual size of the
uncompressed document data is 1.15 GB. So the compacted xapian db is just a
little bigger than the uncompressed document set, nothing to be alarmed
about.

However I am posting the database sizes hereafter, in case they may be of
interest.

By the way, in the course of my 'investigation', I looked for a document
with at least a rough description of the contents and organisation of the
database tables, and how they are used during a query, but, if it does
exist, there doesn't seem to be an obvious pointer to it. Such a document
would be extremely useful to understand what one is doing while using the
API.

For example I had made an assumption that the size of the file path unique
terms that I'm using to identify documents did not matter much because prefix
compression was going to be extremely efficient on them. Actually, after
having a look at the termlist_DB file, this does not appear to be the case
(as they are repeated separately for each document, or am I mistaken
again?), and, for very small documents and long paths, this may become
significant for the termlist_DB size.


Some answers to questions in  your answer:

(database size ~ document set size)
Olly Betts writes:
 > Roughly - generally I'd expect the database to be somewhat smaller than
 > the document set if you're indexing positional information.

Being a personal tool, the assumption for recoll is that space does not
really matter (given that disks are cheap and mostly full of multimedia
which doesnt get into xapian), and it has no stoplist, and basically
indexes any term it can extract however crappy it may look. And no stemming
as you mentionned. Which probably explains why a typical recoll db will
have a size close to the doc set's.

 > Also, do you put a limit on term size?  Omega's indexers ignore
 > probabilistic terms longer than 64 characters, since they're usually
 > junk like uuencoded or base64 data.

Yes, the term size limit is 40 characters. This may probably be a bit low,
but I just can't imagine a user typing a longer than 40 characters search
term :)

 > [...]
 > 
 > Given the setting of XAPIAN_FLUSH_THRESHOLD, the memory used depends
 > mostly on the size of the documents being handled (we buffer the posting
 > lists as we generate them - essentially we build the inverted file in
 > XAPIAN_FLUSH_THRESHOLD document chunks).

After searching for XAPIAN_FLUSH_THRESHOLD, I saw that this question was
repeatedly answered on the mailing list. I hadn't used the right search
terms before :)

Actually, from a user point of view, I think that the relevant parameter to
set is the amount of memory used, not a number of document flush
threshold. Wouldn't it be possible for xapian to maintain a very rough
estimate of memory used during indexation, and flush when it exceeds a set
threshold, independantly of the number of documents indexed ? The threshold
might be trespassed because of big documents, etc..., but this would come
closer to the relevant operational parameter.

The stats follow.

Regards, and apologies again,
J.F. Dockes



The size of the document set data is 232,580 KB but 1,161,119 KB uncompressed

ndocs 244852 lastdocid 244852 avglength 539.113

Total number of terms: 1,141,729
Size of term dump: 26,182,561 bytes (Avg term size 22)
Max term length 40 bytes, except for unique terms identifying documents
(paths) which are longer.

corbieres$ ls -s xapiandb/
total 2477756
      4 meta             524332 postlist_DB         8 termlist_baseA
     20 position_baseA        4 record_baseA        8 termlist_baseB
     20 position_baseB        4 record_baseB   506156 termlist_DB
1257624 position_DB      189536 record_DB           4 value_baseA
     12 postlist_baseA        4 stem_english        4 value_baseB
     12 postlist_baseB        4 stem_french         0 value_DB

corbieres$ quartzcompact xapiandb/ compacted
postlist: Reduced by 55.4271% 290336K (523816K -> 233480K)
record: Reduced by 35.1783% 66608K (189344K -> 122736K)
termlist: Reduced by 40.6442% 205520K (505656K -> 300136K)
position: Reduced by 20.7496% 260696K (1256392K -> 995696K)
value: Size unchanged (0K)

corbieres$ ls -s compacted/
total 1653744
     4 meta            233712 postlist_DB     300436 termlist_DB
     4 position_baseA       4 record_baseA         4 value_baseA
    16 position_baseB       4 record_baseB         4 value_baseB
996676 position_DB     122860 record_DB            0 value_DB
     4 postlist_baseA       4 termlist_baseA
     4 postlist_baseB       8 termlist_baseB

corbieres$ XAPIAN_PREFER_FLINT=yes copydatabase xapiandb flint
corbieres$ ls -s flint/
total 1967908
     0 flicklock           12 postlist.baseB       8 termlist.baseB
     4 iamflint        524332 postlist.DB     319252 termlist.DB
    16 position.baseA       4 record.baseA         4 value.baseA
    16 position.baseB       4 record.baseB         4 value.baseB
997604 position.DB     126632 record.DB            0 value.DB
     8 postlist.baseA       8 termlist.baseA



More information about the Xapian-discuss mailing list