[Xapian-discuss] A few questions wrt Xapian

Wed Nov 5 20:19:50 GMT 2008

Henka wrote:
> 1.  Is Xapian similar to Lucene in the sense that you can define as  
> many fields as you want, and assign various weights (which influence  
> search result sorting) to these fields?  I gather from the docs that  
> you can, but I just need confirmation.

Yes, you can do this.

> 2.  Let's say you're indexing websites; can you then merge/combine  
> many smaller indexes into larger ones for later searching?

Yes (use the xapian-compact tool to do this).  You can also search 
across several indexes without merging them together first - the results 
of searches performed this way are essentially identical to those across 
  a merged index.

> Searching
> 
> 1.  I gather from the docs that you can sort results according to your  
> own field/s, followed by the default document scoring (think  
> "page-rank").  Correct?

Yes, you can store arbitrary extra info with each document and sort by it.

> 2.  ~/docs/remote.htm mentions distributed searching - we want to  
> spread the search load around our cluster by splitting the index into  
> many manageable-sized indexes (to ensure sub-second performance), with  
> a "master" node which combines search results and end-users see.  Is  
> my understanding correct and are there any pitfalls/bottlenecks?

Yes, that exists.  There are probably several pitfalls/bottlenecks, but 
I can't think of any particularly significant ones, and it is quite 
usable anyway.

> 3.  Removing duplicates:  this can be done programmatically I know  
> (but is slow on our chosen platform - Perl), but does Xapian provide  
> this mechanism built-in?  For example:  a search result might return  
> several pages from a web site, but we want to remove these dups and  
> only provide a single result (highest ranking) per website (eg, with a  
> link for "More from this site..." - al-la Google, which will be a  
> separate search displaying all the site-duplicates).

Yes - this is called "collapsing" in xapian.

> 4.  If the mechanism to remove duplicates exists, will this still work  
> cluster-wide in distributed searching?

Yes.

> 5.  Does Xapian provide a mechanism for identifying the actual field  
> in a search result which triggered the hit?  eg, let's say you have  
> TITLE, BODY, OTHER as fields in your index.  If a search found your  
> term in the BODY field, does Xapian provide this as feedback?

You can identify the terms which matched a query, and hence determine 
the fields relating to it, yes.

> 5.  This is difficult I know:  how does Xapian compare  
> performance-wise?  Has anyone done any basic benchmarking?

I have no useful figures to hand.  Please share any you create. ;-)

-- 
Richard