[Xapian-discuss] add_posting(): term position significance - line or offset?
Richard Boulton
richard at lemurconsulting.com
Tue Nov 18 16:38:17 GMT 2008
Henry wrote:
> Greets,
>
> WRT add_posting() and the term's position: presumably it's best to
> use the actual offset in the source as the position, rather than the
> line number containing the term, right?
The usual use is to store the "word number" at which a word appears, and
this is probably what you want. However, you could store the line
number if you wanted: phrase searches (with a window of phrase-size)
would then match when the words were fairly spread out (ie, up to one
per line).
I recommend using word number, anyway, unless you have a very odd
situation I've not thought of.
> I take it this may result in more accurate phrase searching, and
> better general search results since term items' proximity would
> increase their score.
Note that Xapian currently doesn't modify the weight of a phrase based
on how close together the terms are - phrase searches either match a
phrase (in which case the weight is the sum of the weights of the
constituent terms), or don't match the phrase (in which case the phrase
contributes no weight, and the document won't be returned (unless other
parts of the query match it)). This is something that could be
improved, but we haven't had the time (or motivation) to fix it yet...
--
Richard
More information about the Xapian-discuss
mailing list