[Xapian-discuss] Size of spelling database ?

Fabrice Colin fabrice.colin at gmail.com
Sun Nov 4 06:24:52 GMT 2007


Richard, Olly, thanks for your reply.

On 11/3/07, Olly Betts <olly at survex.com> wrote:
> On Fri, Nov 02, 2007 at 11:37:42PM +0800, Fabrice Colin wrote:
> > I have one 192Mb index here with a 51Mb postlist and a 98Mb spelling
> > database. Another index is 597Mb big, with a 159Mb postlist.DB and a
> > 275Mb spelling.DB.
> > I also got a report about a 1,3Gb index with a 408Mb postlist.DB and a
> > 622Mb spelling.DB.
> >
> > Should I be worried ? ;-)
>
> It's certainly worth investigating.
>
> 1.0.3 fixed a bug which was preventing zlib compression from being used,
> so the spelling table will be smaller if you're using 1.0.3 or later.
>
The figures I gave were with 1.0.4, unless I am mistaken.

> The only other thing which comes to mind is that long terms could be
> bloating up the spelling data.  The number of n-grams generated is
> proportional to the term length, and we store the term in a list for
> each n-gram.  We do then prefix-compress and then zlib-compress these
> lists of terms but the extra space required for a term is likely to be
> super-linear in the term length.
>
Do prefixed terms contribute to the spelling database too ? For instance,
terms like Tmime_type Uuri and XDIR:/directory/name etc...

What should I try to diagnose this ?

Fabrice



More information about the Xapian-discuss mailing list