[Xapian-discuss] Compressed Btrees

Olly Betts olly at survex.com
Thu Dec 9 01:05:23 GMT 2004


I've been working on allowing Btree tags to be compressed using zlib,
and it's now working sufficiently well that some external testing would
be useful (a patch to xapian-core 0.8.4.0 is attached).

What I've done is to find a bit in the Btree key header which is always
0 at present.  Now it is set to 1 if the tag value has been compressed
with zlib.  So the patched code can read existing databases (but new
databases won't be readable with the new code if any tags have been
compressed).

I originally intended this to be used for compressing the document data
(which is stored in the record Btree).  But my experiments so far suggest
it's also worth compressing at least the termlist.

So you can experiment, the patch looks for two files per Btree to decide
whether to compress and how to compress.  For the record btree (which
currently uses record_DB, record_baseA, and record_baseB) these files
are record_compress and record_compress_strategy.

If record_compress exists, then quartz will use zlib to try to compress
any tag added to the record table which is more than 4 bytes long.

If record_compress isn't empty, then the contents are used as a dictionary
to seed compression.  So for the record table, putting in a typical record
will improve the compression achieved.

If record_compress_strategy exists, the first byte is read to determine
the compression strategy which zlib will be told to use:

* "F" or "f" : Z_FILTERED
* "H" or "h" : Z_HUFFMAN_ONLY

Otherwise Z_DEFAULT_STRATEGY is used.

The zlib documentation explains this in more detail, but Z_FILTERED tailors
the compression to look less for string matches, which better suits streams
of numbers - in my test Z_FILTERED is the best fit for the postlist table.

Z_HUFFMAN_ONLY only does Huffman encoding.  Newer versions of zlib also
support Z_RLE for run-length encoding, but that's not well suited to the
data in quartz (at least not with the current formats) so that is commented
out in the patch.

The easiest and quickest way to see how well the compression works is to
create a directory with suitable *_compress and *_compress_strategy files
and use quartzcompact to copy a typical quartz database into this directory.
This will neatly (re)compress it to use your chosen settings!  For a fair
size comparisson, you should also run the database through quartzcompact
with no zlib compression enabled.

Here are tests on an old muscat 3.6 database with containing 266078 web pages
(mostly used because it's the largest I had to hand).  This was done before
I had the dictionary support working, so no dictionaries.  Note that this
database has no positional information and no values.

The percentage in the 3rd column is probably of most interest (it's relative to
Quartz format used by Xapian 0.7.0 onwards which improved termlist compression
compared to earlier versions).  The first column compares with Xapian 0.6.X and
the middle column with the proprietary Muscat 3.6 DA format (actually the more
compact "new" DA variant).

vs q    vs m36  vs q    tot size
pre0.7          0.7+    (Kbytes)
 
 61.7%  100.0%  124.3%    832672 m36 - Muscat 3.6 DA
100.0%  162.1%  201.5%   1349540 q - quartz (pre 0.7)
 52.4%   85.0%  105.7%    707828 qc - q after quartzcompact
105.3%  170.6%  212.1%   1420812 q-post0.7 - better termlist compression
 49.6%   80.4%  100.0%    669852 q-post0.7-compact - after quartzcompact
 36.4%   59.1%   73.4%    491828 qc-zdefault - all tables compressed with gzip
 36.8%   59.7%   74.2%    496732 qc-zfiltered - all compressed with Z_FILTERED
 38.1%   61.7%   76.7%    513804 qc-zhuffman - all compressed with Z_HUFFMAN
 36.3%   58.9%   73.2%    490244 qc-zhybrid - Z_FILTERED for postlist only,
                                         Z_DEFAULT_STRATEGY for other tables.

I think it's neat that we can now make databases 41% smaller than Muscat 3.6
DA (which was considered very compact at the time) despite including more
information in them.

But more importantly, the hybrid scheme gives a saving of nearly 27% over
current quartz.  This is with compressed postlists - you probably wouldn't
want to build a large index with compressed postlists, as every flush
would result in a lot of uncompressing and recompressing.  Although once
you're I/O bound, that might still be a win.

But once the database is built, it might be worthwhile compressing the
postlists as well as running it through quartzcompact to produce a
smaller, faster database optimised for searching.

And as I said this is without using dictionaries - which for the record table,
can help a lot - stick a typical record in as the dictionary, and zlib can
even compress a record which starts as short as 5 bytes!

If you've got any large databases, I'd be very interested to see what results
you can get, and also speed tests for building and searching with compression
on.

I think the patch wouldn't build with zlib 1.1.4, although perhaps that was
only true of an earlier version.  I'm currently using zlib 1.2.1 anyhow.

Cheers,
    Olly
-------------- next part --------------
Index: Makefile.am
===================================================================
RCS file: /usr/data/cvs/xapian/xapian-core/Makefile.am,v
retrieving revision 1.138
diff -p -u -r1.138 Makefile.am
--- Makefile.am	2 Nov 2004 04:46:45 -0000	1.138
+++ Makefile.am	8 Dec 2004 19:41:03 -0000
@@ -48,4 +48,5 @@ libxapian_la_LDFLAGS = @XAPIAN_LDFLAGS@ 
 #libxapian_la_LDFLAGS = -export-symbols-regex 'Xapian'
 
 libxapian_la_LIBADD = common/libcommon.la backends/libbackend.la \
- matcher/libmatcher.la languages/liblanguages.la api/libapi.la $(LIBNET_LA)
+ matcher/libmatcher.la languages/liblanguages.la api/libapi.la $(LIBNET_LA) \
+ -L/u1/olly/install/lib -lz
Index: backends/quartz/btree.cc
===================================================================
RCS file: /usr/data/cvs/xapian/xapian-core/backends/quartz/btree.cc,v
retrieving revision 1.180
diff -p -u -r1.180 btree.cc
--- backends/quartz/btree.cc	30 Nov 2004 02:37:19 -0000	1.180
+++ backends/quartz/btree.cc	8 Dec 2004 19:43:38 -0000
@@ -77,6 +77,13 @@ PWRITE_PROTOTYPE
 # include <io.h> // for _commit()
 #endif
 
+#ifdef COMPRESSED_TAGS
+#include <zlib.h>
+// Only try to compress tags longer than this many bytes.
+const size_t COMPRESS_MIN = 4;
+#define DONT_COMPRESS -1
+#endif /* COMPRESSED_TAGS */
+
 // Only useful for platforms like Windows which distinguish between text and
 // binary files.
 #ifndef __WIN32__
@@ -1089,6 +1096,62 @@ Btree::add(const string &key, string tag
 
     form_key(key);
 
+#ifdef COMPRESSED_TAGS
+    bool compressed = false;
+    if (compress_strategy != DONT_COMPRESS && tag.size() > COMPRESS_MIN) {
+	z_stream stream;
+	int err;
+
+	stream.next_in = (Bytef *)(tag.data());
+	stream.avail_in = (uInt)tag.size();
+
+	// If compressed size is >= tag.size(), we don't want to compress.
+	unsigned long blk_len = tag.size() - 1;
+	unsigned char * blk = new unsigned char[blk_len];
+	stream.next_out = blk;
+	stream.avail_out = (uInt)blk_len;
+
+	stream.zalloc = reinterpret_cast<alloc_func>(0);
+	stream.zfree = reinterpret_cast<free_func>(0);
+	stream.opaque = (voidpf)0;
+
+	//cout << "deflateInit about to be called\n" << endl;
+	// -15 means raw deflate with 32K LZ77 window (largest)
+	// memLevel 9 is the highest (8 is default)
+	err = deflateInit2(&stream, Z_DEFAULT_COMPRESSION, Z_DEFLATED, -15, 9,
+			   compress_strategy);
+	//cout << "deflateInit returns " << Z_OK << endl;
+	if (err != Z_OK) abort();
+	if (!compress_dict.empty()) {
+	    err = deflateSetDictionary(&stream,
+		    reinterpret_cast<const Bytef *>(compress_dict.data()),
+		    compress_dict.size());
+	    //cout << "deflateSetDictionary returns " << Z_OK << endl;
+	    if (err != Z_OK) abort();
+	}
+
+	err = deflate(&stream, Z_FINISH);
+	if (err == Z_STREAM_END) {
+	    // If deflate succeeded, then the output was at least one byte
+	    // smaller than the input.
+	    //cout << "deflate compressed "<<tag.size()<<"["<<tag<<"] to ";
+	    tag.assign(reinterpret_cast<const char *>(blk), stream.total_out);
+	    //cout << tag.size()<<"["<<tag<<"]"<<endl;
+	    compressed = true;
+	    err = deflateEnd(&stream);
+	    if (err != Z_OK) {
+		cout << "deflateEnd failed, err = " << err << endl;
+		abort();
+	    }
+	} else {
+	    // Deflate failed - presumably the data wasn't compressible.
+	    (void)deflateEnd(&stream);
+	}
+
+	delete [] blk;
+    }
+#endif /* COMPRESSED_TAGS */
+
     // sort of matching kt.append_chunk(), but setting the chunk
     size_t cd = kt.key().length() + K1 + I2 + C2 + C2;  // offset to the tag data
     size_t L = max_item_size - cd;	 // largest amount of tag data for any chunk
@@ -1119,7 +1182,11 @@ Btree::add(const string &key, string tag
 	size_t l = (i == m ? residue : (i == 1 ? first_L : L));
 	Assert(cd + l <= block_size);
 	Assert(string::size_type(o + l) <= tag.length());
+#ifndef COMPRESSED_TAGS
 	kt.set_tag(cd, tag.data() + o, l);
+#else /* COMPRESSED_TAGS */
+	kt.set_tag(cd, tag.data() + o, l, compressed);
+#endif /* COMPRESSED_TAGS */
 	kt.set_component_of(i);
 	
 	o += l;
@@ -1213,6 +1280,9 @@ Btree::read_tag(Cursor * C_, string *tag
     if (n > 1) tag->reserve(max_item_size * n);
 
     item.append_chunk(tag);
+#ifdef COMPRESSED_TAGS
+    bool compressed = item.get_compressed();
+#endif /* COMPRESSED_TAGS */
 
     for (int i = 2; i <= n; i++) {
 	next(C_, 0);
@@ -1220,6 +1290,105 @@ Btree::read_tag(Cursor * C_, string *tag
     }
     // At this point the cursor is on the last item - calling next will move
     // it to the next key (Bcursor::get_tag() relies on this).
+#ifdef COMPRESSED_TAGS
+    if (!compressed) return;
+
+    // FIXME: Perhaps we should we decompress each chunk as we read it so we
+    // don't need both the full compressed and uncompressed tags in memory
+    // at once.  But at least this makes for a cleaner patch while this is
+    // still somewhat experimental.
+    z_stream stream;
+    int err;
+
+    string utag;
+    // May not be enough for a compressed tag, but it's a reasonable guess.
+    utag.reserve(tag->size() + tag->size() / 2);
+
+    Bytef buf[8192];
+
+    //cout << "inflating" << endl;
+    if (!compress_dict.empty()) {
+	Bytef header[6] = { 0x78, 0x20, 0, 0, 0, 0};
+	uLong adler = adler32(0L, Z_NULL, 0);
+	adler = adler32(adler,
+			reinterpret_cast<const Bytef *>(compress_dict.data()),
+			compress_dict.size());
+	set_int4(header, 2, adler);
+
+	stream.next_in = header;
+	stream.avail_in = 6;
+    } else {
+	stream.next_in = Z_NULL;
+	stream.avail_in = 0;
+    }
+
+    /* Check for source > 64K on 16-bit machine: */
+//	if ((uLong)stream.avail_in != tag->size()) abort();
+
+    stream.next_out = buf;
+    stream.avail_out = (uInt)sizeof(buf);
+//	if ((uLong)stream.avail_out != sizeof(buf)) abort();
+
+    stream.zalloc = reinterpret_cast<alloc_func>(0);
+    stream.zfree = reinterpret_cast<free_func>(0);
+
+    //cout << "inflateInit" << endl;
+    err = inflateInit(&stream);
+    if (err != Z_OK) {
+	cout << "inflateInit err = " << err << endl;
+	abort();
+    }
+
+    if (!compress_dict.empty()) {
+	err = inflate(&stream, Z_SYNC_FLUSH);
+	if (err != Z_NEED_DICT) {
+	    cout << "don't need dict?!" << endl;
+	    abort();
+	}
+	err = inflateSetDictionary(&stream,
+		reinterpret_cast<const Bytef *>(compress_dict.data()),
+		compress_dict.size());
+	if (err != Z_OK) {
+	    cout << "iSD failed with err = " << err << endl;
+	}
+    }
+
+    stream.next_in = (Bytef*)tag->data();
+    stream.avail_in = (uInt)tag->size();
+
+    while (err != Z_STREAM_END) {
+	stream.next_out = buf;
+	stream.avail_out = (uInt)sizeof(buf);
+	err = inflate(&stream, Z_SYNC_FLUSH);
+	if (err == Z_BUF_ERROR && stream.avail_in == 0) {
+	    cout << "Z_BUF_ERROR - faking checksum of " << stream.adler << endl;
+	    Bytef header[4];
+	    set_int4(header, 0, stream.adler);
+	    stream.next_in = header;
+	    stream.avail_in = 4;
+	    err = inflate(&stream, Z_SYNC_FLUSH);
+	    if (err == Z_STREAM_END) break;
+	}
+
+	if (err != Z_OK && err != Z_STREAM_END) {
+	    cout << "inflate err = " << err << endl;
+	    abort();
+	}
+
+	utag.append(reinterpret_cast<const char *>(buf),
+		    stream.next_out - buf);
+    }
+    Assert(utag.size() == stream.total_out);
+    if (utag.size() != stream.total_out) {
+	cout << "utag.size() != stream.total_out (" << utag.size() << "!=" << stream.total_out << ")" << endl;
+	abort();
+    }
+
+    err = inflateEnd(&stream);
+    if (err != Z_OK) abort();
+
+    swap(*tag, utag);
+#endif /* COMPRESSED_TAGS */
 }
 
 void
@@ -1348,6 +1517,38 @@ Btree::basic_open(bool revision_supplied
     max_item_size = (block_size - DIR_START - BLOCK_CAPACITY * D2) / BLOCK_CAPACITY;
 
     base_letter = ch;
+
+#ifdef COMPRESSED_TAGS
+    // If foo_compress exists then try to compress tags in foo_DB with zlib
+    // using the contents of foo_compress as a dictionary.
+    int fd = ::open((name + "compress").c_str(), O_RDONLY | O_BINARY);
+    compress_strategy = DONT_COMPRESS;
+    if (fd > 0) {
+	compress_strategy = Z_DEFAULT_STRATEGY;
+	compress_dict = sys_read_all_bytes(fd, 65536);
+	::close(fd);
+    }
+    fd = ::open((name + "compress_strategy").c_str(), O_RDONLY | O_BINARY);
+// Look at the first character of foo_compress_strategy:
+//	Z_DEFAULT_STRATEGY for normal data
+// F/f:	Z_FILTERED for data produced by a filter (or predictor)
+// H/h:	Z_HUFFMAN_ONLY to force Huffman encoding only (no string match)
+//// R/r:	Z_RLE to limit match distances to one (run-length encoding)
+    CompileTimeAssert(DONT_COMPRESS != Z_DEFAULT_STRATEGY);
+    CompileTimeAssert(DONT_COMPRESS != Z_FILTERED);
+    CompileTimeAssert(DONT_COMPRESS != Z_HUFFMAN_ONLY);
+//    CompileTimeAssert(DONT_COMPRESS != Z_RLE);
+    if (fd > 0) {
+	string s = sys_read_all_bytes(fd, 1);
+	::close(fd);
+	switch (s[0]) {
+	    case 'F': case 'f': compress_strategy = Z_FILTERED; break;
+	    case 'H': case 'h': compress_strategy = Z_HUFFMAN_ONLY; break;
+//	    case 'R': case 'r': compress_strategy = Z_RLE; break;
+	    default: compress_strategy = Z_DEFAULT_STRATEGY;
+	}
+    }
+#endif /* COMPRESSED_TAGS */
 
     /* ready to open the main file */
 
Index: backends/quartz/btree.h
===================================================================
RCS file: /usr/data/cvs/xapian/xapian-core/backends/quartz/btree.h,v
retrieving revision 1.86
diff -p -u -r1.86 btree.h
--- backends/quartz/btree.h	19 Nov 2004 04:21:11 -0000	1.86
+++ backends/quartz/btree.h	8 Dec 2004 19:41:03 -0000
@@ -24,6 +24,9 @@
 #ifndef OM_HGUARD_BTREE_H
 #define OM_HGUARD_BTREE_H
 
+// Compress tags with zlib.
+#define COMPRESSED_TAGS
+
 #include <algorithm>
 #include <string>
 using std::string;
@@ -179,7 +182,12 @@ public:
     Item_base(T p_, int c) : p(p_ + GETINT2(p_, c)) { }
     Item_base(T p_) : p(p_) { }
     T get_address() const { return p; }
+#ifdef COMPRESSED_TAGS
+    int size() const { return (GETINT2(p, 0) & 0x7fff); } /* I in diagram above */
+    bool get_compressed() const { return *p & 0x80; }
+#else /* COMPRESSED_TAGS */
     int size() const { return GETINT2(p, 0); } /* I in diagram above */
+#endif /* COMPRESSED_TAGS */
     int component_of() const {
 	return GETINT2(p, GETK(p, I2) + I2 - C2);
     }
@@ -265,9 +273,16 @@ public:
 	set_component_of(1);
     }
     // FIXME passing cd here is icky
+#ifdef COMPRESSED_TAGS
+    void set_tag(int cd, const char *start, int len, bool compressed) {
+#else /* COMPRESSED_TAGS */
     void set_tag(int cd, const char *start, int len) {
+#endif /* COMPRESSED_TAGS */
 	memmove(p + cd, start, len);
 	set_size(cd + len);
+#ifdef COMPRESSED_TAGS
+	if (compressed) *p |= 0x80;
+#endif /* COMPRESSED_TAGS */
     }
     void fake_root_item() {
 	set_key_len(K1 + C2);   // null key length
@@ -692,6 +707,14 @@ class Btree {
 	 *  when updating (in Btree::add_item().
 	 */
 	byte * split_p;
+
+#ifdef COMPRESSED_TAGS
+	/** DONT_COMPRESS or Z_DEFAULT_COMPRESSION, Z_HUFFMAN_ONLY, Z_RLE. */
+	int compress_strategy;
+
+	/** The dictionary (if any) to use for tag compression. */
+	string compress_dict;
+#endif /* COMPRESSED_TAGS */
 
 	/* Debugging methods */
 //	void report_block_full(int m, int n, const byte * p);


More information about the Xapian-discuss mailing list