[Xapian-discuss] Xapian Feedback

Wed Dec 29 06:26:53 GMT 2004

On Tue, Dec 28, 2004 at 01:02:22PM -0500, Michael Salib wrote:
> We've been integrating xapian into our code over the last few weeks and
> have run into several issues. We've been able to work around all of
> them, but I wanted to share this list with you so that you could see
> what kind of problems (admittedly atypical) users are running up
> against. The list is ordered from most important to least important. If
> you'd like me to post it to a mailing list, just tell me to which list I
> should send it.

I think other people may find this useful (or at least interesting) so
I'll reply to the xapian-discuss list and quote liberally.

> I'm apologize for the length of this message. We're not expecting
> answers from you for these issues; I just wanted to let you know what
> is going on.
> 
> 
> 1. xapian operations block for arbitrary amounts of time. Since our
> application is an asynchronous network server (basically -- there is a
> lot more to it), we have to run xapian from a special indexing process
> and send all indexing and query operations between the two processes.
> We'd like to move xapian into our async network process and dispense
> with the secondary indexing process, but we can't afford to block for
> long periods of time. The only operation we're really worried about is
> adding a new document to the db. What would be great is if we could add
> new docs asynchronously, asking xapian to do a bounded amount of work
> (say 10 ms) with each call to a finishIO method. I.e., something like:
> 
> d = xapian.Document()
> for pos, term in enumerate(terms):
>     d.add_posting(term, pos)
> db.add_document(d)
> doneYet = False
> while not doneYet:
>     doneYet = db.finishIO()
>     doOtherAsyncWorkHereSomehowOrOther()

It would be hard to bound the work to a specified length of time, since
when we're I/O bound it's so dependent on the disk subsystem.

But the work done by add_document should be fairly small if an automatic
flush isn't triggered.  So you could set XAPIAN_FLUSH_THRESHOLD to a
very large value (such as 2^32 - 1), and then explicitly call flush()
when you're happy for a longer operation.

> We've considered running xapian code in a separate thread in the same
> process. Our main concern is that the swig glue code won't release
> Python's global interpreter lock during lengthy index operations.
> Refusal to yield the GIL is the conservative thing to do unless you can
> be certain that the document object being added won't be accessed
> outside the thread executing add_document. If the GIL is not released
> during add_document calls, then our application will still block for the
> duration of such calls, even when add_document is run from a separate
> thread. Do you have any idea whether this will happen?

I don't know enough about the SWIG internals to say.

As far as the C++ API is concerned, you're safe if you only access a
particular object from one thread.  You can safely open and work with
different database objects in different threads (and they can even refer
to the same database on disk), but you can't access an object from
multiple threads at once unless you use locks to marshal accesses.

> 2. It is extremely difficult to ensure that a xapian db has been
> properly closed, especially in the face of errors. The problem here is
> that xapian dbs have no explicit close method; they close when the db
> object is freed. In Python, its trivially easy to get a reference to an
> object and quite difficult to ensure that no one at all is referencing
> an object. In particular, if an exception is thrown from code dealing
> with a db object, the db object can become captured in the corresponding
> traceback object. That traceback object may never be freed, thus keeping
> the db object alive forever. This is not a theoretical problem: we've
> run into this often enough to add it to our test set.
> 
> In the process of debugging that problem, I discovered that Python
> postpones deallocation of objects that should be freed as a result of
> exception handling until after exception handlers have completed. This
> illustrates that it can be quite difficult to predict when an object in
> Python will ever be deallocated. Moreover, changes in an unrelated body
> of code can create a reference cycle involving a xapian index object.
> Such a cycle prevents the reference counting machinery from freeing the
> db, leaving it to the garbage collector, which may not run for quite
> some time.
> 
> The fact that C/C++ objects can hold a reference to the db complicates
> matters further. For example, if an enquire object or query parser is
> left lying around, then the db object either of them references will
> never be freed.
> 
> Our code goes to *great* lengths to ensure that no one can ever capture
> a reference to the db and that all references are properly released in
> the event of any error. This code was difficult to write and feels
> fragile; I don't feel confident about the prospect of mixing threads
> with such delicate code. If we could explicitly call close on a db in
> such a way that garunteed that the db object would close, life would be
> much simpler and I could sleep better at night.

It would probably be fairly simple to add a WritableDatabase::close()
method.  We already have a suitable "dummy" state which you get from
the default constructor:

    Xapian::WritableDatabase wdb;

Hmm, except the WritableDatabase class is really just a reference
counted pointer.  So each WritableDatabase which refers to the same
internal object has a separate copy of the pointer to that internal
object.  Which means we can't just use the dummy state we already have -
we need to handle this in the internal object.

So it's still possible, but a little more involved.

> 3. Memory consumption seems excessive. If we index lots of data, memory
> consumption grows and grows and does not shrink even after we call
> flush(). Do you think we could reduce memory consumption by calling
> reopen periodically when indexing many documents? Right now, we flush
> the db after indexing about 200 KB of text and that has bounded memory
> consumption for us but we're growing concerned that flushing so often
> may limit write performance by triggering unneeded disk syncs.

The default value of XAPIAN_FLUSH_THRESHOLD is 10000 documents, which
assumes you want to index fairly fast and don't mind using quite a lot
of memory.  All the changes to the posting lists are buffered in RAM
(changes to the other tables are made on disk).

The changes are stored in STL map structures, and by default, GCC's STL
implementation holds onto memory for later reuse.  It is released for
reuse, but you could run out of virtual memory this way.

It's on my todo list to try to buffer the postlist changes in a more
compact way.  I've also wondered about trying to use blocks allocated
outside the heap (e.g. with anonymous mmap) so we can return them to
the OS easily.  I think this just needs a suitable allocator to be
used with to the appropriate STL maps.

> 4. Xapian file locking does not mesh well with our application. Because
> locks have no information about the process (or host) which last held
> the lock, we have no way of knowing whether the index is still locked by
> another thread/process on the same machine or whether the lock file is
> leftover from a severly botched shutdown. This problem could be fixed by
> adding a process ID and hostname nonce to the lockfile and documenting
> it so that upon finding a lock file, we could check for the process that
> created it and act appropriately.

The plan is to move to fcntl() locking (and the roughly equivalent file
locking calls on Windows).  But the obstacle to overcome is that fcntl
locks are owned by a process, so we lose locking between different
threads in a process - if one thread opens a Xapian database for
writing, another process won't be able to, but another thread in the
same process will succeed.  To address this we seem to need to somehow
also have a mutex for each database - a little tricky as we don't really
want to just key off the filename as symlinks mean you can open the same
database using different paths.  Device major + device minor + inode
probably does the trick.  And ideally we also need some way to allow
optional used of pthreads from Xapian without requiring all Xapian using
applications to pull in pthreads.

I had a poke around the net a while ago, but didn't really find an
answer to how other applications address this problem.

Perhaps we just say that inter-thread locking doesn't work on UNIX until
POSIX specifies a thread aware file locking API.  That seems lame though.

> For that matter, we have no idea under what conditions it is safe to
> delete a lock file. This really should be documented somewhere; we
> absolutely cannot ship code that fails to run after a machine crashes
> because of a stale logfile. Right now, we delete lock files when needed,
> but information on when it is safe to do so and the consequences of
> doing so would be most welcome.

Essentially you can safely delete it if-and-only-if the process which
took the lock is no longer running (which must mean it was terminated
abnormally).  I.e. the cases where an fcntl lock would have been
released automatically.  There are a few other cases where you'll
probably get away with it with great care, but I wouldn't recommend
relying on them.

> 5. There is no way to sort search results by value in reverse order. We
> need this functionality and hack around it by adding a second "reverse"
> value for ever value we want to sort on.

See: http://xapian.org/cgi-bin/bugzilla/show_bug.cgi?id=57

> 6. The behavior of the query parser is highly...underdocumented.

Yeah.  The API needs cleaning up too, and it needs fixing to be
reentrant.  The lack of reentrancy is a limitation of using Bison, so
it needs rewriting to use something else.  Lemon is my current favoured
option, and I've even made a start.

> We expended considerable effort early on discovering that the query
> parser employs an english language stemmer by default. Since we do no
> stemming, this caused problems for us. We only learned how to disable
> query parser stemming by trial and error.

For the record, either of:

QueryParser::set_stemming_options("");
QueryParser::set_stemming_options("none");

> I still do not understand set_prefix

If you want the QueryParser to understand "author:" and terms from the
author field have an "A" prefix, then use: set_prefix("author", "A")

> and why searching for
> INDEXERVERSION:1 does not work when using the query parser while a raw
> query of INDEXERVERSION1 works as expected. It seems that the query
> parser does not like numbers following prefixes, but I am at a loss to
> explain why that is.

I think it's a bug.  Or at least QueryParser uses a rather delicate rule
for when to add a ":" between the prefix and the term, which scriptindex
doesn't implement.  The rule is undocumented (except in the code) so
it's arguable who is correct.

> 7. A lot of the documentation and mailing list discussions involve
> people using omega rather than xapian as a library. That's fine, but I
> wish there was better documentation regarding how to build something
> like omega starting with xapian. For example, here are some questions
> that I've had trouble getting answers to without reading source code:

I'll answer them here, and slot the answers into the documentation when
I have time.

> How do I map my own UIDs to xapian documents?

If they're positive numbers (strictly positive - 0 isn't valid) and not
sparse, you can just map them to xapian document ids.  Just use
replace_document() instead of add_document() so you can specify the ID.

Otherwise, the standard approach is to add a Q<UID> term to each
document.  There now are variants of replace_document() and
delete_document() which take a uid term instead of a document id.

> What is the key length limit?
> How does that limit interact with embedded nulls in the key?

See http://article.gmane.org/gmane.comp.search.xapian.devel/183

The short answer is that you can't safely have a term longer than 245
bytes.  And each zero byte counts double, so the limit is 122 bytes
if they're all zero bytes.

Son-of-quartz will fix the zero byte stupidity, and probably push the
limit up nearer 255.

> How do I fake field searching (i.e., subject, sender, etc)?

Index terms from fields with a prefix - e.g. XSUB for subject, XSEND for
sender.  Then set_prefix("subject", "XSUB"), etc to tell the query
parser how to map them.  Note you can create aliases (e.g.
set_prefix("title", "XSUB") so both "title:" and "subject:" work)
and also you can translate the prefixes the user sees into other
languages - e.g. set_prefix("sujet", "XSUB") for a French frontend.

> How do I weight some fields higher than others?

Pass a higher value for the wdfinc parameter to Document::add_posting().
This optional third parameter defaults to 1.  Document::add_term() also
has an optional wdfinc parameter.

> Do I need to call set_database on the query parser?

Currently, never I'm slightly suprised to find.  It may help the query
parser in the future though - there's commented out code for an
automatic keyword thing which Ananova were interested in.  They were
paranoid that people wouldn't be able to search for the latest stupidly
named pop sensation in their news archive, so the idea was if they
indexed a term Kb*witched then a search for B*witched would check if
such a term existed and if so use it.

But being able to check for term existance might be useful for improved
phrase handling.  Currently something like "e-mail" generates a phrase
search, and is often a slow case.

> When do I need to reopen a db?

When you open a db for reading, you open a snapshot version.  If two
updates are committed to that db, the snapshot will be overwritten and
you will get a DatabaseModifiedError exception.  At this point you need
to reopen() the db and retry the operation which failed.  Or perhaps
restart from earlier if a set of operations need to be done on the same
snapshot of the database for consistency.

Requiring users to handle this rare special case (or else ensure 2
updates can't happen during a search) is rather a pain, and the plan for
son-of-quartz is that readers will lock revisions they're using to
eliminate DatabaseModifiedError.

> We've come up with answers to all except the last three questions so
> obviously this isn't fatal, but it took a lot more experimentation and
> testing than we would have liked. I'm thinking of putting together a
> short web page explaining what we learned to save others the trouble
> later on.

Documentation patches are also most welcome!

I started setting up a wiki to facilitate this sort of thing, but it's
stalled due to me being too busy recently.

> 8. It would be nice if xapian exceptions were properly translated into
> python exceptions

I think this is waiting on rejigging exception handling in the C++
library.

> and if there was a documented list of ALL exceptions
> that xapian could raise and whether calling reopen and retrying the
> operation would make any difference.

http://www.xapian.org/docs/apidoc/html/errortypes_8h.html

Currently there's nothing fine grained as to which methods can throw
which exceptions.  It's a reasonable job to determine this, and I can't
see an easy way to ensure it's kept up to date.  Some methods are
documented to throw certain exceptions, but even those lists may be
incomplete.

The plan is to try to shift exception throwing from the depths of the
library to a thin wrapper layer, and also expose a non-exception
API.  Then it should be much clearer which methods can throw which
exceptions.

Only DatabaseModifiedError means you should call reopen().

> 9. It would be nice if we could refer to values by strings rather than
> integers. Our application includes our own object database with indexes;
> we convert index keys in our object database into xapian values. For
> example, an index in our object database is "atop.store.DateIndex" which
> has values that look like "2004-04-02". It would be handy if we xapian
> would manage the mapping from "atop.store.DateIndex" to value number 33.
> Right now we manage that mapping. This is a very low priority issue, but
> I suspect that lots of people would prefer sorting by strings rather
> than integers. We don't need full blown support for string value naming;
> I'd happily settle for some way of keeping a string to integer map in
> the xapian db.

The original idea was that values should be fast to access, so we'd
number them rather than naming them.  With the way values are currently
stored that doesn't actually make so much sense, but this might change.

And it does stop people using long names and then complaining they're
slow.  You wouldn't want to use "atop.store.DateIndex" in every key of
the Btree value table.

But as you suggest, we could provide a way of managing mappings from
names to numbers.

> 10. Is there a way to access btree:max_key_len from python? Since our
> terms may contain embedded nulls, we need to know this limit to avoid
> adding a document with a term that exceeds this limit.

There isn't, but max_key_len isn't a number you're interested in anyway.
See the gmane URL above for details.

We could add a WritableDatabase method to return the max safe term length
for a particular database (the value depends on the backend).  But we'd
need to be conservative and return 122 for Quartz.  Or provide a "no
zero byte" version as well.

> A very minor but
> related issue is that when you add a document with a key exceeding the
> limit, there is no way to find out which key caused problems.

Perhaps the key (or better still the term name) should be included in
the error message.

> Thanks for reading this far! By the way, I'm the primary guy at divmod
> working on xapian. When I say "we" above, I ususally mean "I". Despite
> the issues, xapian has been a joy to work with and we're very happy with
> it.

Cool.

Cheers,
    Olly