[Xapian-discuss] A few questions wrt Xapian

Henka henka at cityweb.co.za
Mon Nov 3 10:09:54 GMT 2008


Greetings all,

I'm about to evaluate Xapian for a future project and would appreciate  
a few comments from those in the know:

Indexing

1.  Is Xapian similar to Lucene in the sense that you can define as  
many fields as you want, and assign various weights (which influence  
search result sorting) to these fields?  I gather from the docs that  
you can, but I just need confirmation.

2.  Let's say you're indexing websites; can you then merge/combine  
many smaller indexes into larger ones for later searching?


Searching

1.  I gather from the docs that you can sort results according to your  
own field/s, followed by the default document scoring (think  
"page-rank").  Correct?

2.  ~/docs/remote.htm mentions distributed searching - we want to  
spread the search load around our cluster by splitting the index into  
many manageable-sized indexes (to ensure sub-second performance), with  
a "master" node which combines search results and end-users see.  Is  
my understanding correct and are there any pitfalls/bottlenecks?

3.  Removing duplicates:  this can be done programmatically I know  
(but is slow on our chosen platform - Perl), but does Xapian provide  
this mechanism built-in?  For example:  a search result might return  
several pages from a web site, but we want to remove these dups and  
only provide a single result (highest ranking) per website (eg, with a  
link for "More from this site..." - al-la Google, which will be a  
separate search displaying all the site-duplicates).

4.  If the mechanism to remove duplicates exists, will this still work  
cluster-wide in distributed searching?

5.  Does Xapian provide a mechanism for identifying the actual field  
in a search result which triggered the hit?  eg, let's say you have  
TITLE, BODY, OTHER as fields in your index.  If a search found your  
term in the BODY field, does Xapian provide this as feedback?

5.  This is difficult I know:  how does Xapian compare  
performance-wise?  Has anyone done any basic benchmarking?



Thanks for any information you can provide.

Regards
Henry



More information about the Xapian-discuss mailing list