[Xapian-discuss] Compressed Btrees

Arjen van der Meijden arjen at glas.its.tudelft.nl
Mon Dec 13 14:26:28 GMT 2004


Olly Betts wrote:
> On Mon, Dec 13, 2004 at 02:23:00PM +0100, Arjen van der Meijden wrote:
> 
>>This is on the non-compacted database (currently I don't have a 
>>compacted one):
> 
> 
> The results would be the same anyway.
> 
> 
>>entries: 293400883
>>Totals:
>>Before: 1680133099
>>After:  1189099066
>>Compressed by: 29.3%
>>Theoretical limit (assuming uniform): 1188233055
>>
>>If I understand it correctly this will be the compression on top of the 
>>compaction (which only yields 8% reduction) of the position-table ?
> 
> 
> It's not totally obvious how to translate it - this figure is just for
> the change in size of the tag values.  There's also storage for the keys
> and general overhead from the tree structure.  But if the tags are
> shorter then they'll generally be split into fewer items inside the
> Btree, which means fewer keys need to be stored.  And the less there is
> in the Btree, the less overhead there is.
> 
> So you should expect the size of position_DB to decrease by somewhat
> more than (1680133099 - 1189099066) bytes.  Is this the 6.3G
> position_DB?  If so, I'm suprised it only has 1.6G of tags.
> 
> But assuming it is, you'd expect the filesize to go down by at least
> 29.3*1.6/6.3 or around 7.5%.  It will probably be substantially better
> than that though.

Yes its the 6.3G (or 6.9 non-compacted) table. Does that mean the rest 
of the data is mostly structural (keys to access the tags + 
btree-overhead) ?

Reading this small piece of information from the Xapian-website:
"PositionList. For each (term, document) pair, this stores the list of 
positions in the document at which the term occurs.

Key: pack_uint(did) + tname "

I'm actually not sure whether I should be surprised by that or not. A 
lot of terms are rather unique in a document and/or relatively long, so 
it isn't very strange if a key (docid + term) is actually longer than 
its tag (list of positions), or am I missing something important here? :)

Best regards,

Arjen van der Meijden



More information about the Xapian-discuss mailing list