[Xapian-discuss] Question: Query weights, Rset usage, lowercase

Olly Betts olly at survex.com
Sat Dec 9 03:54:03 GMT 2006


On Sat, Dec 09, 2006 at 09:55:11AM +0800, Andrey Kong wrote:
> 1)How much cost if I put the Descriptions inside the
> Xapian.document.data field? (assume the Descriptions are unHTML
> contents of web pages), will the Xapian DB become very big and
> affects the preformance? (i have 1M docs when testing)

Assuming the usual pattern of searching for 10 or so matches, this
shouldn't be a problem at all.

The document data is stored in a separate file, so there should be
no effect on matching, aside from competing for disk cache.  You'll have
similar competition for disk cache anyway if you're pulling the same
data from an SQL database hosted on the same machine.

> 2)Since now i am able to search the Title(prefix PT, weight=20) and
> Descriptions(no prefix, weight=1) of the database, I begin wondering
> how to assign different weights to the Query. How to achive:
> 
> Query using "OR" (Microsoft , Keyboard , Mouse) 
> 
> which the term "Microsoft" =weight 5 | "Keyboard" = wieght 1 | "Mouse" = weight 1

Just set the within query frequency (wqf) - e.g. Query("microsoft", 5).

> Because its normal that ppl will type in the most important terms
> first and then the less important terms later, so i want to make the
> query in the same approach.

I have my doubts about this idea.  The risk is that you'll improve
results for some queries while making others worse.

I think people tend to enter queries with the natural word order.
Sometimes the more important terms will be first, and sometimes they
won't.

In this case, "Microsoft" is performing an adjectival role by defining
a narrower scope for the words which follow, which is why it's perhaps
more important.  But this varies between languages - in spanish it would
probably be "mouse de Microsoft" not "Microsoft mouse".

> 3)Since I add my own prefixes manually, I wonder does Xapian change
> all Terms into lowercase automatically? Or I need to do it manually?

Xapian treats terms as opaque pieces of data, so you'll need to
lowercase them yourself if that's what you want.  Otherwise it wouldn't
be possible to implement a case-sensitive search.

> 4)when i query ("search engine") , if  I add 3 docs to the Rset, does
> this "Rset related to -search engine-" remains in the database? So
> next time I have the same query "search engine", the 3 docs in the
> Rset can be retrived from the database? how to do that?

The RSet isn't stored in the database.  The RSet represents a set of
relevance judgements which a user has made pertaining to a particular
query (or more generally to a particular "information need").  If you
want to store it, it almost certainly needs to be per user and probably
per query too.  In a web application, I'd suggest storing it in a cookie.

> I think it will be even more great, if there are 2-5 lines of example
> of usage in the API document.

Yes, that would be good (though I think many would need a larger
example to be useful).  However, it would be a substantial amount of
work and we're all already very busy.  Patches are welcome of course
(if anyone wants to work on this, please add examples to the doxygen
comments in the headers, not the HTML documentation which is
automatically generated from them!)

> If every function has a 3-5 lines of codes of example of usage, we can
> understand the function and usage in 5secs. Without the example, I say
> I used 3-5 Hours to test it out myself, some just gave up...

I'd suggest you simply search the examples (or failing that, Omega) for
the particular method you want to see in context.  For most methods,
that will find you an actual working piece of code using the method.

Cheers,
    Olly



More information about the Xapian-discuss mailing list