[Xapian-discuss] Size of spelling database ?

Olly Betts olly at survex.com
Fri Nov 2 18:29:22 GMT 2007


On Fri, Nov 02, 2007 at 11:37:42PM +0800, Fabrice Colin wrote:
> I have one 192Mb index here with a 51Mb postlist and a 98Mb spelling
> database. Another index is 597Mb big, with a 159Mb postlist.DB and a
> 275Mb spelling.DB.
> I also got a report about a 1,3Gb index with a 408Mb postlist.DB and a
> 622Mb spelling.DB.
> 
> Should I be worried ? ;-)

It's certainly worth investigating.

1.0.3 fixed a bug which was preventing zlib compression from being used,
so the spelling table will be smaller if you're using 1.0.3 or later.

The only other thing which comes to mind is that long terms could be
bloating up the spelling data.  The number of n-grams generated is
proportional to the term length, and we store the term in a list for
each n-gram.  We do then prefix-compress and then zlib-compress these
lists of terms but the extra space required for a term is likely to be
super-linear in the term length.

Cheers,
    Olly



More information about the Xapian-discuss mailing list