[Xapian-discuss] Another query parser bug

Tue Oct 23 17:13:29 BST 2007

On Tue, Oct 23, 2007 at 04:35:04PM +0200, Ron Kass wrote:
>    print "wrong: ".$QueryParser->parse_query(qq{Title:word
>    -notallowed},(FLAG_BOOLEAN | FLAG_PHRASE | FLAG_LOVEHATE |
>    FLAG_WILDCARD))."\n";

I think this is related to a problem I noticed earlier this week - we
fail to parse filter-type operations in the middle of a query:

     foo site:example.org bar
     foo -site:example.org bar
     foo -ignore bar

People tend to specify filters at the end, which I guess is why
nobody noticed this before.

I looked into those cases and it's down to the grammar rules not
allowing it, which is a bug, but a bit more involved to fix than
your previous one.  I'll add your testcases to mine and check they
all work when I fix this.

> And one last question regarding the parser in this case..
> Should/Could there be any performance difference between the following
> three parsed queries? (FILTER vs AND_NOT and AND_NOT*2 vs AND_NOT/OR)
> 1. Xapian::Query(((Zterm:(pos=1) Znotallow:(pos=2)) FILTER (Tfirst OR
> Tword)))

There seems to be an operator (AND_NOT?) missing before Znotallow.

> 2. Xapian::Query(((Zterm:(pos=1) AND_NOT Znotallow:(pos=2) AND_NOT
> Tfirst:(pos=3)) FILTER Tword))
> 3. Xapian::Query(((Zterm:(pos=1) AND_NOT (Znotallow:(pos=2) OR
> Tfirst:(pos=3))) FILTER Tword))

I can see that (2) and (3) are essentially the same query represented in
two different ways.  But (1) seems to be a different query (no matter
what the missing operator is).  If that's correct, then (1) clearly can
(and often will) perform differently to (2) and (3).

Currently, (2) and (3) will actually be executed in different ways.  I'm
not certain which would be more efficient (and it may depend on the
data).  I suspect there's not much in it unless there are a lot of
filter terms, in which case my hunch is that (3) might have the edge
because of the balancing we do for OrPostList trees.  If you have, or
can easily produce, some benchmark data, it would be interesting to
know.

I've implemented an internal "QueryOptimiser" class for 1.0.4 which
provides a much improved framework for building optimal postlist trees
from queries, so it's now much easier to do these sort of things.

Cheers,
    Olly