GSoC 2016 "Project : Weighting Scheme" Intro

Nishad Dawkhar nishad.dawkhar94 at gmail.com
Sat Mar 12 14:55:58 GMT 2016


Hi, I have some doubts regarding the implementation of weight normalization
schemes(eg: cosine wt norm.) in TfIdf. To implement these, the weights of
all the terms in a document are needed. If the score of each document is
available seperately, it can be normalized by dividing each document score
with the sqr-root of the sum of squared-TfIdfWeights of individual terms in
that document.
Reference:
http://www.ics.uci.edu/~djp3/classes/2008_09_26_CS221/Lectures/Lecture26.pdf

In the Xapian code I tried searching for such a list of scores of documents
that contain query terms, but couldn't find any. I didn't completely
understand the working of MultiMatch::get_mset() which produces the list of
relevant items. It would be great if someone can provide information about
the workings of this method in some detail, and how the scores of
individual documents can be retreived so as to compute normalizations on
each of them. I have read http://xapian.org/docs/matcherdesign.html , but I
did not understand the exact functioning. The details of the matching
process will be needed to implement this normalization.

In the current TfIdf weighting scheme's get_sumpart() method, there is a
method which gets called before returning the final wt : get_wtn() . The
proposed normalization is supposed to be implemented here. But I don't
think that it is possible to calculate the normalized weight from this
point as we need the weights contributed by every term in this particular
document. It would probably be costly to calculate each terms weight in
this method. Hence, as I've mentioned before, it would be a good idea to
carry out this normalization after initial document scores have been
calculated.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20160312/d6f2a660/attachment.html>


More information about the Xapian-devel mailing list