[Xapian-discuss] Get the FLINT_BTREE_MAX_KEY_LEN variable in
Python
Olly Betts
olly at survex.com
Tue Nov 28 03:20:13 GMT 2006
On Sat, Nov 25, 2006 at 02:24:36PM -0200, Rafael SDM Sierra wrote:
> How can I get this variable?
It's actually a constant, not a variable and its value isn't currently
available via the API (the constant itself isn't actually directly
useful, but a "maximum term length" would be). This is on my list of
things to sort out, but it's complicated by zero bytes in terms being
treated specially in this area. My plan is to fix that and then we can
have a "max term length" constant or API call.
Here's a bit of background: The btree manager which Quartz and Flint
both use versions of has a maximum key length of 252 bytes. But because
the keys contain more than just term names, the maximum safe length for
a term is 240 bytes (or perhaps a few more, but 240 is certainly safe).
There's one further wrinkle - any zero bytes in a term require 2 bytes
in the the quartz key.
Another oddity is that the key for some of the Btree tables has the
document id encoded using a variable length coding, so bigger document
ids need more bytes. So the absolute maximum term length varies by
document by a byte or two!
Currently I recommend imposing a sane threshold when tokenising text
to produce terms, which is also wise as otherwise things like uuencoded
text and ASCII art can generate lots of useless junk terms which just
bloat the database! Omega uses a theshold of 64 for this.
And also ensure that boolean terms can't be too long - for this 240
bytes is a safe limit. If you need to make a term from something
which can be longer like a URL, you might want to look at how Omega
handles this by hashing the tail of long URLs. The code is in
omindex.cc, function make_url_term.
> instead when I add the posting, so, to avoid these kind of error, I
> can do some "if len(word) > max_length_allowed: continue" and go
> ahead, I think that this value is variable in some machines, here is
> 987 bytes.
987 is the length of the key generated from a term you tried to add,
not the threshold for key length you exceeded (which is 252 in both
flint and quartz). This limit isn't variable, though the effective
limit on term length is slightly, as noted above.
Cheers,
Olly
More information about the Xapian-discuss
mailing list