[Xapian-discuss] About field weight

Olly Betts olly at survex.com
Mon Apr 3 23:37:15 BST 2006


On Sun, Apr 02, 2006 at 05:46:05PM +0200, David Levy wrote:
> - I am searching for the terms "ipod vidéo 60" (with the OR operator)
> - the first results sorted by relevance are :  (name / description)
> 
>    1. Etui en cuir Shinnorie EZgoing pour iPod avec vidéo 60 Go - Blanc
> 
>[...]
> 
>    5. L'iPod vidéo 60 Go
>[...]
> - I don't understand howcome the 6th results is not in first position.

I assume you mean 5th not 6th, because you didn't show us the 6th
result...

> Indeed, here is a part of my scriptindex configuration :
> name: weight=5 unhtml index index=S field=caption
> description: weight=1 unhtml index field=sample
> 
> that means *name* should be considered 5 times more relevant than
> *description*, isn't it ?

"5 times more relevant" seems a little misleading.  Specifying
"weight=5" means precisely that an appearance of a term in "name" is
equivalent to 5 appearances of that term in "description".

But note that, both hits 1 and 5 have all 3 search terms appearing once
in the "name" field.

The overall weight takes the document length into account too (which is
the sum of the wdfs of all terms which index the document).  I assume
the 1st match ranks higher because it has more occurences of the search
terms in "description" given the document length than the 5th.

Xapian doesn't allow you to consider the lengths of individual fields
separately (at least not at present).

> Maybe I should have put higher weight to the *name* field ? like 20 or 100
> instead of 5 ?

Well, both "name" fields will still have all 3 search terms appearing
once.  There would be some effect from a long "name" field inflating the
document length.

I'd be concerned that using a large weight might decrease the quality of
retrieval results in general though.

Actually, what would give you the result your after is an idea I've had
in the back of my mind for a while, but haven't had a chance to try out
- for want of a better name, I'll call it "auto phrase search".

With this, a query for `ipod video 60' would be roughly equivalent to:

    "ipod video 60" OR (ipod video 60)

So if the phrase matches in a particular document, that document gets a
higher weight than if it doesn't.  If the phrase doesn't match any
documents, the results will be the same as for the query as currently
handled.

Cheers,
    Olly



More information about the Xapian-discuss mailing list