<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">
</head>
<body bgcolor="#ffffff" text="#000000">
<br>
<blockquote cite="mid20060516042322.GB3551@survex.com" type="cite">
<blockquote type="cite">
<pre wrap="">Is there a way to do adaptive query scoring (as in popular results
returned by a query should get more weight because they are getting
clicked more often) in xapian? Is this what the rset class should be
used for?
</pre>
</blockquote>
<pre wrap=""><!---->
You could use the RSet to achieve something like this by recording
which documents users like for which queries and setting an RSet from
that when there's a query for the same terms. It would probably
make sense to use a second Xapian database to store the queries matching
each document click so you'd run a search on that to find what to set
the RSet as on the main database.
</pre>
</blockquote>
Which approach do you think would be easier - and more importantly,
give the least overhead? It seems to me that adding adaptive-terms (or
whatever would be a good term for these!) and just rewrite the queries
and work on one xapian db only would mean less overhead (and less
maintenance). What do you think? Would you be able to be as versatile
with the RSet approach, ie use the adjacent-word approach like you
suggest below?<br>
<blockquote cite="mid20060516042322.GB3551@survex.com" type="cite">
<pre wrap="">
</pre>
<blockquote type="cite">
<pre wrap="">I could write a php app to do adaptive results scoring for separate
words (just recording the clicks and then have a cron:ned script add
weight to the document_id:s for the recorded words)
</pre>
</blockquote>
<pre wrap=""><!---->
That would be another way - you could add a prefixed term (e.g.
XCLICKfoo) to those documents which the user selected when they
had searched for "foo". Then turn a search term "foo" into
(foo ANDMAYBE XCLICKfoo) (must match foo, if XCLICKfoo also matches
add the weight from that.)
</pre>
</blockquote>
Yep, this sounds workable.<br>
Does the ANDMAYBE operator add much overhead to queries? Would it be
faster to just use the OR operator? If a result matches the XCLICK*
term, it _must_ also match the original term.<br>
<br>
<blockquote cite="mid20060516042322.GB3551@survex.com" type="cite">
<pre wrap="">I'm not totally sure that matters - for the example you give, there's
going to be a very strong correlation.</pre>
</blockquote>
Not a very good example, agreed :)<br>
<blockquote cite="mid20060516042322.GB3551@survex.com" type="cite">
<pre wrap=""> There certainly are words which
have many meanings where there's less correlation (e.g. 'stock market'
vs 'vegetable stock') and even word order can make a big difference
(e.g. 'oil bath' vs 'bath oil'). But for the 'stock' example, a query
for just 'stock' could useful promote results from both, and a query
for 'stock market' would have 'market' in too, so although the cookery
pages would get a boost, the financial pages would get larger one.
</pre>
</blockquote>
Yeah there must be tons of word pairs out there that would benefit from
some sort of 'mutual' scheme, but then there are probably a great deal
that would suffer from them too. Especially in our data set.<br>
<blockquote cite="mid20060516042322.GB3551@survex.com" type="cite">
<pre wrap="">
In fact, I suspect you would improve retrieval overall simply by
favouring pages which somebody has clicked on for some query (especially
for a search over random web sites - the web is full of useless junk
which nobody will ever want in their results). That approach is
particularly susceptible to "clickbot" abuse though.
</pre>
</blockquote>
I have a pretty special set of data to search on. I am building a
search app for a large shopping portal, and the data I search through
comes from merchants product feeds. Since our users are for the most
time logged in when they use our site, I can mitigate clickbotting
quite well by only letting each user throw one 'vote' per word/phrase
and day, or maybe per word/phrase ever. Sacrifices some 'input' from
non-logged in users, but at least makes clickbotting difficult. Might
do it based on IP for non-logged in users. Sacrifices some NAT-users,
but you can't win all :p<br>
<br>
I can see that your theory here of favoring all results that gets
clicked regardless of the query can work for a regular web search, but
I don't think it will pan out as well for us, who have specific
products as results.<br>
<blockquote cite="mid20060516042322.GB3551@survex.com" type="cite">
<pre wrap="">
But anyway, if you want to work with phrases, the hard part is to decide
what's a phrase. Then just generate a term for the phrase e.g.
"XCLICKxbox console". If you're going to treat the whole query as a
phrase, I'd suggest you try generating terms from adjacent word pairs
(so 'natural history museum' gives "XCLICKnatural history" and
"XCLICKhistory museum").
</pre>
</blockquote>
Sounds like a pretty good idea to me. Get a 'mutual' effect, but one
that is limited enough to hopefully keep the adaptive-terms accurate
for the product itself. I shall give it a good think and maybe write an
experimental implementation here later today.<br>
<blockquote cite="mid20060516042322.GB3551@survex.com" type="cite">
<pre wrap="">
I'd love to hear how you get on.
</pre>
</blockquote>
Absolutely, I love to get feedback from the creator of the actual
search engine here :)<br>
I'd be happy to contribute code back to the Xapian project if you think
there is any use for it. I can only offer php code, but for example I
have two classes, one indexing and one search class, which may be
suitable for php-examples for other php:ers to look at. They use many
more of the xapian features than the present examples do.<br>
<br>
Regards<br>
Alec<br>
</body>
</html>