[Xapian-discuss] add_prefix() versus add_boolean_prefix()
Daniel Ménard
Daniel.Menard at ehesp.fr
Tue Nov 18 16:37:54 GMT 2008
Thanks a lot, Olly, for answering my previous question and forgive me
for replying so lately : I didn't get a chance to answer before...
I'm not completely sure that I totally understood your explanations, but
now I'm pretty sure that I have a bad conception of what boolean
prefixes are. I saw that Torsten Foertsch had a quite similar question
(http://thread.gmane.org/gmane.comp.search.xapian.general/6711) and have
read the answers Jim Lynch and you gave him. I also searched the site
and the list archives for something giving a definition but only found
examples of what they can be used for and details about how to use them
(api), which is good, but I'm still not completely clear about what
boolean prefixes can do and what they can't...
I will try to explain what I intended to do so, hopefully, someone will
see where I'm wrong (sorry, it's a bit long...)
Our use case is not the general one: we're using xapian to search
bibliographical records containing things like titles, abstracts, notes
and other fields which can be used to restrict a search : typical ones
(type of document, year of publication, language, country...) but also
fields containing "controlled vocabulary" (keywords, publisher,
collection, organization, periodical...)
What we want to do is to define "views" on these records. In our mind, a
view is a set of rules which restrict the corpus of records against
which the query will be searched. Some of these rules are "hardcoded" in
our application (they are combined with the user query by using a
OP_FILTER operator), but some of these rules can also be defined by the
user (e.g. health promotion date:2008 publisher:"Editions Masson"). In
our mind, such a request really means : find records about health
promotion but only those published in 2008 by Masson.
Assuming that the default operator used by the query parser is OR, the
above query will not give "good" results: from a user point of view, any
document which is not from Editions Masson or were not published in 2008
is just "noise". Using AND as the default operator would help but is a
bit too strict : "health promotion" really is a free text query and a
document having only the term "promotion" is probably still a good
answer for our database (of course it will works fine if the user
manually add the AND operators at the good places in her query).
I also suppose that these filters will impact the score obtained by the
free text part of the query if I use OR or AND (I'm not completely clear
about what it means with AND : the doc says that OP_AND sums the score
from both branches but how does it impact the final MSet and its order?).
So I thought that I had to use OP_FILTER (or perhaps OP_SCALE_WEIGHT
with a factor of 0 ? is-it the same ?) and that boolean prefixes were
the good way to do that... I define "publisher", "date" and so on as
boolean prefixes and the query parser "magically" do what I want : it
extracts the filters from the user query and combine them in a OP_FILTER
clause which will have no impact on the ranking...
It works fine if my filters are simple terms (e.g. date:2008) but not if
I use something which is more complex : phrases, brackets or even
wildcards... hence my previous mail which Olly replied.
I understood from Olly's answer that this behavior was indeed expected
(boolean prefixes are not intended to do what I'm trying to do) but I
fail to understand why..
I'm pretty sure I'm missing something which is obvious for others...
Perhaps I'm just lacking some theoretical background... (and english is
not my native language, which does not help!)
Thanks a lot for your patience,
Daniel
PS : below, some precisions interleaved with Olly's replies.
Olly Betts a écrit :
>> [test author:(john doe)]
>>
> It's a bad example to use "author:" here, since that would naturally
> be a free-text search, and it means that examples which looks reasonable
> don't necessarily make much sense in the actual boolean prefix case.
>
I still don't get it... In my mind, the author clause is a filter :
either the document is written by this author, either it is not, which
looks like a boolean clause for me..
And ideally, the scoring would only take "test" into account, ignoring
any weight contributed by this filter clause.
> [...] you can't apply a boolean prefix to a subexpression [...]
Is it a current limitation of the query parser or is there a fundamental
reason why it can't be possible ?
> In this case the subexpression isn't boolean, so as a better
> example, it's like this where "type:" is a boolean prefix:
>
> type:(html pdf)
>
> I'm not really sure that makes a lot of sense
I read it as a bracketed expression containing two terms which would be
combined using the QueryParser's default operator giving a pure boolean
query like
Xapian::Query(0 * (XTYPEhtml OR XTYPEpdf))
> I can see that there's a natural meaning for this case, which I don't
> think we currently handle:
>
> type:(html OR pdf)
>
I confirm: currently, the query parser gives me
pdf:(pos=2) FILTER type:(html
for this query.
>> A similar problem appear if I try a phrase search: [test author:"john
>> doe"] gives
>> Xapian::Query(((test:(pos=1) OR doe:(pos=2)) FILTER A"john))
>>
> I'm not really sure what you expect this to mean - a phrase isn't a
> boolean sub-expression, and I wouldn't expect boolean filter terms to
> have positional information.
>
As above, I don't get it... (I'm feeling really sorry...)
By using the api, I can create the following query
Xapian::Query((test:(pos=1) FILTER (XAUTHORjohn:(pos=1) PHRASE 2
XAUTHORdoe:(pos=2))))
but I can't generate it by using the query parser.
I'm sure there is a very good reason for the query parser to parse so
differently depending on the fact that a prefix is declared as boolean
or normal, but, once again, I miss it...
> Looking at a better example, what would you expect this to mean?
>
> type:"html pdf"
>
for me, it means a pure boolean query (only a filter clause) containing
a phrase search... something like this :
Xapian::Query(FILTER (XTYPEhtml:(pos=1) PHRASE 2 XTYPEpdf:(pos=2))))
or perhaps like this :
Xapian::Query(0 * (XTYPEhtml:(pos=1) PHRASE 2 XTYPEpdf:(pos=2)))
> Incidentally, http://trac.xapian.org/ticket/128 suggests it should be a
> single filter term with a space in, which seems a reasonable way to
> allow that to be specified. So in this case, the term would be:
>
> XTYPEhtml pdf
>
I'm not sure to understand how it correlates to boolean prefixes...
Again, thank you for your patience,
--
Daniel Ménard
More information about the Xapian-discuss
mailing list