[Xapian-discuss] Term prefixes (was: Xapian Feedback)

Olly Betts olly at survex.com
Fri Jan 14 17:23:40 GMT 2005


I wrote:
> I think it's a bug.  Or at least QueryParser uses a rather delicate rule
> for when to add a ":" between the prefix and the term, which scriptindex
> doesn't implement.  The rule is undocumented (except in the code) so
> it's arguable who is correct.

I've been looking at this some more.

We need some way to distinguish the term prefix from the term itself.

The scheme Omega uses is that a single upper case letter is a term prefix,
unless it's an X.  An X signals a longer term prefix.

So the question really is when such a prefix ends:

* omindex doesn't create such prefixes, so has no real view on the
  matter.  It does create single character prefixes, and just appends
  the value to them.

* scriptindex takes an optional "prefix" for boolean and text indexing.
  When text indexing, this prefix is simply prepended to the term, except
  that raw terms (which already get an R prefix) get a ":" inserted
  between the specified prefix and the R prefix if the specified prefix
  is 2 or more characters long and doesn't end in a ":" already.  All
  the examples for scriptindex don't include an explicit ":".

  For boolean terms, the prefix is simply prepended to the term.

  So "Olly" with prefix "XABC" is indexed as "XABColli" and "XABC:Rolly".
  The rationale for this is that otherwise prefixes "XABC" and "XABCR"
  get confused.  Note that scriptindex doesn't enforce the "X" rule for
  multi-character prefixes.

* omega only looks at prefixes directly when handling "B" parameters.
  Terms with the same prefix are OR-ed, then groups with different
  prefixes are AND-ed.  E.g.

  (Ttext/plain OR Ttext/html) AND Hwww.xapian.org
  
  For this, it assumes a single character prefix unless it starts with
  X, in which case it takes the longest all uppercase prefix.  If
  there's an ':' after this, it is ignored (not part of the prefix
  or the value).

* Xapian::QueryParser uses this code:

    if (prefix.length() > 1) {
	unsigned char back = prefix[prefix.length() - 1];
	if (back != ':') {
	    if (!C_isupper(back) || C_isupdig(term[0])) {
		prefix += ':';
	    }
	}
    }

  which doesn't match what the Omega indexers do especially well.

  If the prefix is a single character, or already has a ":", this doesn't
  do anything.

  For a multi-character prefix, this adds a ":" if the last character of the
  prefix isn't upper case (which is peculiar but harmless given the rules
  everything else uses).  It also adds a ":" if the term starts with an
  uppercase letter (good) or a digit (bad).

One issue is that what Omega chooses to do with prefixes is currently just
one way of using the library.  Except for Xapian::QueryParser, the xapian-core
library really just treats terms as strings of bytes.  Longer term, perhaps we
should look at supporting prefixes (and fielded searching in general) more
explicitly, even if it's by pushing a system similar to the above down into the
library where we can hide the oddities behind the scenes better.  In
particular, it would be nice to split "document length" per field so you can
search just document titles or abstracts from papers with the appropriate
length corrections in the weights.

But in the short term, we want everything to be working in step.  I think
the simplest fix (which in particular avoids requiring database rebuilds)
is to change Xapian::QueryParser to check C_isupper(term[0]) and drop the
C_isupper(back) test.

If you've specified an explicit ":" to scriptindex, you'll also need to specify
it to Xapian::QueryParser (which currently you'll mostly get away with not
doing), but that's fair enough really.

I think scriptindex should perhaps also warn if you specify a multi-character
prefix which doesn't start with an "X", since Omega and Xapian::QueryParser
won't necessarily handle it as you'd hope.

Cheers,
    Olly



More information about the Xapian-discuss mailing list