[Xapian-discuss] Compressed Btrees

Olly Betts olly at survex.com
Thu Dec 9 10:03:21 GMT 2004


On Thu, Dec 09, 2004 at 10:43:39AM +0100, Arjen van der Meijden wrote:
> On 9-12-2004 2:05, Olly Betts wrote:
> >If record_compress isn't empty, then the contents are used as a dictionary
> >to seed compression.  So for the record table, putting in a typical record
> >will improve the compression achieved.
> 
> How does one find a 'typical record' and in what format should it be 
> entered?

I've not experimented much, but what you're after is a chunk of text which is
at most 32K long, which will provided a rich source of substrings which will
also occur in read records.  Commonest substrings should ideally be towards the
end, and making it too long may slow compression (but probably not
decompression if I understand things).

For omega, something like this is probably a reasonable starting point:

url=http://www.com/index.html
sample=
type=text/html

But note that once you've set this and compressed some tags, you can't
change it or they won't decompress.  You *can* compress some tags without
a dictionary, and then add one.

> I'll test it with our database, using your hybrid settings, perhaps 
> position_DB is another good candidate to run in filtered-mode?

Very likely.  If it's not too long a process for your databases, you can just
compress each table each of the 3 ways and mix and match the results by copying
(say) record_* from one compacted directory to another.

Incidentally, quartzcompact now reports some statistics for the size reduction
achieved for each table.

Cheers,
    Olly



More information about the Xapian-discuss mailing list