GSOC 2016 project on Ranking

James Aylett james-xapian at tartarus.org
Mon Mar 7 14:52:30 GMT 2016


Arnav -- as suggested in our guidance for GSoC students, please keep
conversations on the mailing list so everyone can both help and
benefit :)

On Mon, Mar 07, 2016 at 01:04:07PM +0530, Arnav Jain wrote:

> I have read the documentation, installed the code and getting familiar with
> the code. I saw about the projects on Weighting Schemes and Learning to
> Rank. I would like to discuss about both projects so that I can start
> making some changes in the code. I also went through the previous GSoC
> projects on Learning to Rank and the read the blog post on a good
> introduction to the basics of learning to rank. I also went through the
> research work done on ranking systems. I have a few queries.
>
> 1) Xapian is using edit distance for spelling correction right now.
> Weighted edit distance can be used which gives high weight to letters that
> are more likely to get misplaced. Also bigrams can be used to find the
> correct words in a query using Bayes theorem.

Okay -- that sounds like something you want to put into your proposal,
along with suitable references if there are any (particularly if
there's concrete work on using something in an IR context). It's a
good idea to indicate if there are upsides/downsides/other
considerations to either of those. For instance, would either be
significantly slower at search time than the current edit distance
approach? Would the bigram work want to have bigrams indexed directly
(there is an unmerged GSoC project looking at how to do that which
could form the basis for this if it makes sense).

Other distance functions that may be worth considering include ones
tailored to T9 (which is still used in some places, although possibly
not enough to justify implementing something specific) or predictive
text. Both could perhaps improve spelling correction for mobile /
small device input.

> 2) How Xapian is expanding a query?

If you're talking about how spelling corrections are expanded, then
most of the information you need should be in the getting started
guide
<https://getting-started-with-xapian.readthedocs.org/en/latest/howtos/spelling.html>.
There's a discussion of the algorithm, and limitations, on that page.

> I also want to know how it is finding all the synonym words of a
> particular work in a query?

Xapian has no direct support for finding synonyms; it uses a synonym
dictionary which you (as a user of Xapian) have to build up. There's
some (but admittedly not much!) information on this in the getting
started guide
<https://getting-started-with-xapian.readthedocs.org/en/latest/howtos/synonyms.html>,
which should help. Unfortunately there's no example code, although the
python tests in xapian-bindings do show how to use the API.

J

-- 
  James Aylett, occasional trouble-maker
  xapian.org



More information about the Xapian-devel mailing list