<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

  <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">

</head>

<body bgcolor="#ffffff" text="#000000">

<br>

<blockquote cite="mid20060516042322.GB3551@survex.com" type="cite">

  <blockquote type="cite">

    <pre wrap="">Is there a way to do adaptive query scoring (as in popular results

returned by a query should get more weight because they are getting

clicked more often) in xapian?  Is this what the rset class should be

used for?

    </pre>

  </blockquote>

  <pre wrap=""><!---->

You could use the RSet to achieve something like this by recording

which documents users like for which queries and setting an RSet from

that when there's a query for the same terms.  It would probably

make sense to use a second Xapian database to store the queries matching

each document click so you'd run a search on that to find what to set

the RSet as on the main database.

  </pre>

</blockquote>

Which approach do you think would be easier - and more importantly,

give the least overhead?&nbsp; It seems to me that adding adaptive-terms (or

whatever would be a good term for these!) and just rewrite the queries

and work on one xapian db only would mean less overhead (and less

maintenance). What do you think?&nbsp; Would you be able to be as versatile

with the RSet approach, ie use the adjacent-word approach like you

suggest below?<br>

<blockquote cite="mid20060516042322.GB3551@survex.com" type="cite">

  <pre wrap="">

  </pre>

  <blockquote type="cite">

    <pre wrap="">I could write a php app to do adaptive results scoring for separate

words (just recording the clicks and then have a cron:ned script add

weight to the document_id:s for the recorded words)

    </pre>

  </blockquote>

  <pre wrap=""><!---->

That would be another way - you could add a prefixed term (e.g.

XCLICKfoo) to those documents which the user selected when they

had searched for "foo".  Then turn a search term "foo" into

(foo ANDMAYBE XCLICKfoo) (must match foo, if XCLICKfoo also matches

add the weight from that.)

  </pre>

</blockquote>

Yep, this sounds workable.<br>

Does the ANDMAYBE operator add much overhead to queries?&nbsp; Would it be

faster to just use the OR operator?&nbsp; If a result matches the XCLICK*

term, it _must_ also match the original term.<br>

<br>

<blockquote cite="mid20060516042322.GB3551@survex.com" type="cite">

  <pre wrap="">I'm not totally sure that matters - for the example you give, there's

going to be a very strong correlation.</pre>

</blockquote>

Not a very good example, agreed :)<br>

<blockquote cite="mid20060516042322.GB3551@survex.com" type="cite">

  <pre wrap="">  There certainly are words which

have many meanings where there's less correlation (e.g. 'stock market'

vs 'vegetable stock') and even word order can make a big difference

(e.g. 'oil bath' vs 'bath oil').  But for the 'stock' example, a query

for just 'stock' could useful promote results from both, and a query

for 'stock market' would have 'market' in too, so although the cookery

pages would get a boost, the financial pages would get larger one.

  </pre>

</blockquote>

Yeah there must be tons of word pairs out there that would benefit from

some sort of 'mutual' scheme, but then there are probably a great deal

that would suffer from them too. Especially in our data set.<br>

<blockquote cite="mid20060516042322.GB3551@survex.com" type="cite">

  <pre wrap="">

In fact, I suspect you would improve retrieval overall simply by

favouring pages which somebody has clicked on for some query (especially

for a search over random web sites - the web is full of useless junk

which nobody will ever want in their results).  That approach is

particularly susceptible to "clickbot" abuse though.

  </pre>

</blockquote>

I have a pretty special set of data to search on. I am building a

search app for a large shopping portal, and the data I search through

comes from merchants product feeds. Since our users are for the most

time logged in when they use our site, I can mitigate clickbotting

quite well by only letting each user throw one 'vote' per word/phrase

and day, or maybe per word/phrase ever. Sacrifices some 'input' from

non-logged in users, but at least makes clickbotting difficult. Might

do it based on IP for non-logged in users. Sacrifices some NAT-users,

but you can't win all :p<br>

<br>

I can see that your theory here of favoring all results that gets

clicked regardless of the query can work for a regular web search, but

I don't think it will pan out as well for us, who have specific

products as results.<br>

<blockquote cite="mid20060516042322.GB3551@survex.com" type="cite">

  <pre wrap="">

But anyway, if you want to work with phrases, the hard part is to decide

what's a phrase.  Then just generate a term for the phrase e.g.

"XCLICKxbox console".  If you're going to treat the whole query as a

phrase, I'd suggest you try generating terms from adjacent word pairs

(so 'natural history museum' gives "XCLICKnatural history" and

"XCLICKhistory museum").

  </pre>

</blockquote>

Sounds like a pretty good idea to me. Get a 'mutual' effect, but one

that is limited enough to hopefully keep the adaptive-terms accurate

for the product itself. I shall give it a good think and maybe write an

experimental implementation here later today.<br>

<blockquote cite="mid20060516042322.GB3551@survex.com" type="cite">

  <pre wrap="">

I'd love to hear how you get on.

  </pre>

</blockquote>

Absolutely, I love to get feedback from the creator of the actual

search engine here :)<br>

I'd be happy to contribute code back to the Xapian project if you think

there is any use for it. I can only offer php code, but for example I

have two classes, one indexing and one search class, which may be

suitable for php-examples for other php:ers to look at. They use many

more of the xapian features than the present examples do.<br>

<br>

Regards<br>

Alec<br>

</body>

</html>