[Xapian-discuss] A few questions wrt Xapian
Richard Boulton
richard at lemurconsulting.com
Wed Nov 5 20:19:50 GMT 2008
Henka wrote:
> 1. Is Xapian similar to Lucene in the sense that you can define as
> many fields as you want, and assign various weights (which influence
> search result sorting) to these fields? I gather from the docs that
> you can, but I just need confirmation.
Yes, you can do this.
> 2. Let's say you're indexing websites; can you then merge/combine
> many smaller indexes into larger ones for later searching?
Yes (use the xapian-compact tool to do this). You can also search
across several indexes without merging them together first - the results
of searches performed this way are essentially identical to those across
a merged index.
> Searching
>
> 1. I gather from the docs that you can sort results according to your
> own field/s, followed by the default document scoring (think
> "page-rank"). Correct?
Yes, you can store arbitrary extra info with each document and sort by it.
> 2. ~/docs/remote.htm mentions distributed searching - we want to
> spread the search load around our cluster by splitting the index into
> many manageable-sized indexes (to ensure sub-second performance), with
> a "master" node which combines search results and end-users see. Is
> my understanding correct and are there any pitfalls/bottlenecks?
Yes, that exists. There are probably several pitfalls/bottlenecks, but
I can't think of any particularly significant ones, and it is quite
usable anyway.
> 3. Removing duplicates: this can be done programmatically I know
> (but is slow on our chosen platform - Perl), but does Xapian provide
> this mechanism built-in? For example: a search result might return
> several pages from a web site, but we want to remove these dups and
> only provide a single result (highest ranking) per website (eg, with a
> link for "More from this site..." - al-la Google, which will be a
> separate search displaying all the site-duplicates).
Yes - this is called "collapsing" in xapian.
> 4. If the mechanism to remove duplicates exists, will this still work
> cluster-wide in distributed searching?
Yes.
> 5. Does Xapian provide a mechanism for identifying the actual field
> in a search result which triggered the hit? eg, let's say you have
> TITLE, BODY, OTHER as fields in your index. If a search found your
> term in the BODY field, does Xapian provide this as feedback?
You can identify the terms which matched a query, and hence determine
the fields relating to it, yes.
> 5. This is difficult I know: how does Xapian compare
> performance-wise? Has anyone done any basic benchmarking?
I have no useful figures to hand. Please share any you create. ;-)
--
Richard
More information about the Xapian-discuss
mailing list