[Xapian-discuss] xapian's cache

Fri Nov 23 23:32:58 GMT 2007

On Fri, Nov 23, 2007 at 02:25:31PM -0800, Andrey wrote:

> About the "warming-up" of xapian from the first few queries, in which 
> prespective does it cache the data in?
> xapian / xapian-binding / filesystem IO?

Right now Xapian does (effectively) no explicit caching; it lets the
operating system cache whatever it likes. This makes it difficult to
answer most of your questions without knowing exactly what your
operating system is (and details of how it caches). However in
general, assuming there is enough core (physical memory) for the
processes to never go into swap, the remaining memory will be used to
cache blocks from the filesystem. From now on when I say 'cache' I
mean 'operating system filesystem cache'.

[Right now I'll point out that I can't remember any of the deep
details of how flint btrees are likely to map onto disk blocks, and so
some of this may need to be elaborated on or corrected by Olly or
Richard.]

When a Xapian writer flushes the database to disk, a number of file
system blocks will change. How many cached blocks in the reader
operating system become invalidated at that point will depend on
details of your database and indexing and search profiles; you're best
off measuring the effects of various changes here.

If you have a writer using local disk and exporting to a remote reader
(presumably using NFS), you are using the memory in the writer for two
distinct things: caching the blocks off disk of the revision being
used by the reader (so that requests from the reader that aren't in
the reader's cache already will incur only the network overhead, not a
hit to disk on the writer as well) and caching the blocks onto disk of
the revision being assembled by the writer. (It's a little more
complex than that because of the way revisions work, but hopefully
that's a helpful view.)

In very high performance situations, you /may/ get better mileage out
of having the storage local to the reader, not the writer (throw lots
of memory at the reader), or in a different box altogether (throw lots
of memory at both reader and backend storage). However there may also
be advantages to having the storage local to the writer (see below).

Note that if your continual indexing process is 'sane' (by which I
mean it's nowhere near intensive enough to risk getting behind - ie
it's mostly sleeping, not actually doing work) then the memory in the
writer isn't so important (but if the writer is also the final storage
machine, the memory for that is important).

> What happen to the cache when the DB is flush? The cache in memory
> will gone or will incrementally added up?

That depends on lots of things. Whatever has the storage local to it
will do a pretty good job of throwing away invalidated cache blocks
and, where necessary, reloading the freshened blocks from disk. (If
the writer is on the same operating system instance, those freshened
blocks are likely to already be in cache because of write-behind, in
which case: win! Nothing has to hit disk to get them into core,
assuming you have enough memory.)

If the reader doesn't have local storage, it will have its own (now
invalid) blocks cached. A good NFS implementation will deal with this
fairly efficiently (NFSv4 more so than NFSv3, with the caveat that
some NFSv4 implementations seem less stable in all sorts of nasty edge
cases; however when you're pushing stuff that hard you're always going
to have to do more work, so I'd ignore that for the time being). It'll
need to go back across the network to freshen the block (assuming it
needs that block again) or to fetch a new one (if that block is no
longer used, which is a minor pain as it might not be invalidated if
it's no longer used but unchanged; you can probably trust your OS to
do the sensible thing here and just throw it away eventually in favour
of blocks that are still being used). With luck you'll have enough
memory on your storage box that the majority of these (ie: the most
common blocks, ie those blocks needed for the most common searches)
will be in core, so you won't actually hit disk there.

(Some NFS implementations allow you to cache on disk, either by an
extension layer above NFS or built into the file system implementation
itself. The same kind of thing applies there, except that you might
get better speed than having to do a network hit, depending on the
relative speed of network vs local disk, and your disk loading.)

It would be nice to be able to point a monitoring system at a running
OS and figure out what's going on in its cache usage. You can get this
kind of data to an extent on some systems, with the caveats that (a)
it will take up memory, and so slow things down if you're running
short on core, and (b) it will take up processor time. However, given
a bit of time (and perhaps the risk that sometimes your system will
respond much slower than it should as you work out the right tuning
parameters), you can do it externally by measuring what you care about
and tuning to improve that measurement. (This has the added advantage
that you don't need to know intimately how your OS caches work.)

The big message is: measure it, change it a bit, measure it
again. Empirical data coming out of realistic simulated (or actual
real live) searches and indexing using your code is the only real way
know that you're improving things.

> notice that the DB keep flushing every 10,000 doc (@5mins), will the search 
> preformance better-off if seperated to 2 DBs, and search over them like 
> this? will the cache of db1 stays and benefits?
> db1 < very large
> db2 < only todays document, flush every 5mins 10,000 doc

Possibly, but not necessarily for caching reasons. I *think* (Olly or
Richard should jump in here) that providing your underlying filesystem
block size is the same as the btree block size that you won't see a
huge amount of difference in terms of caching efficiency. You should
get other benefits, particularly around inserting into db2 (because
the btree isn't nearly as big).

Finally, note that there are many other routes you can take. Without
knowing anything about what scale you're trying to achieve, what your
budget is, and so on, no one's going to be able to give you a set of
instructions on how to build the best system for your needs. (And even
if someone could, they'd probably want to charge you a consulting fee
for it ;-)

J

-- 
/--------------------------------------------------------------------------\
  James Aylett                                                  xapian.org
  james at tartarus.org                               uncertaintydivision.org