[Xapian-discuss] Compressed Btrees

Olly Betts olly at survex.com
Sun Dec 12 19:55:32 GMT 2004


On Sat, Dec 11, 2004 at 10:53:39AM +0100, Arjen van der Meijden wrote:
> But it only reports those statistics on record_ and value_.

Hmm, it looks like the stat() call to read the size isn't working for
files over 2G in size.  I'll see if I can work out why.

> The differences in size are rather marginal. But the most compact 
> results would be achieved by:
> Record:   filtered

I'm a bit suprised by this, though the difference is pretty small (<1%).
It looks like your records are small, so perhaps that's why.

> However it may be more efficient to just not compress the postion-db, 
> since there seems to be only a small gain for the extra cpu-power, 
> rounded all four are 6.3G in size.

Interpolative coding will work much better for the positions anyway.
I've written code to calculate the compression achievable for a given
position list, though I've not written actual compression and
decompression code yet.  I'll try to sort out something to let you see
how well they'll compress at least.

> I didn't test with dictionaries and stuff, since I don't fully 
> understand how I can fetch and create a good dictionary.

It's just a text file containing something which looks kind of like most
of your records do.  The name "dictionary" is rather misleading really,
but that's the terminology zlib uses.  It's just some typical text
that's effectively run through the compressor first, and the output
thrown away.  Just picking the data from a record at random will
probably work well enough.  I came across a paper which suggested that
calculating the optimal dictionary was hard (some form of NP IIRC,
but I don't have the reference to hand).

> (If you'd like to experiment with that yourself, contact me off-list
> Olly)

I don't have the time right now.

Cheers,
    Olly



More information about the Xapian-discuss mailing list