[Xapian-discuss] [ NUMBER OF SAMPLE ]
Eric B. Ridge
ebr at tcdi.com
Wed Jul 21 19:51:15 BST 2004
On Jul 21, 2004, at 12:59 PM, Boris Meyer wrote:
> I'm diving into the Api, looking for some methods to retrieve this
It ain't there! :) The best you can do is get the "positional" data,
which I'm willing to bet is "word position" with Omega.
>> Right now one must re-parse the document, joining up with the terms
>> list from the result to find and highlight any/all hits, let alone
>> context extraction. A fairly expensive operation if you're doing to
>> do this on a "summary display" of many documents.
> Yes a very consuming process, especially when the average size of the
> documents I would have to parse is known, 3Mo (Pdf), don't forget the
> x10 results/page please ;-).
PDF text extraction is a pain in the ass. I've got a handwritten PDF
parser (in Java) that does a decent job of text extraction (better than
xpdf in raw mode, in my opinion), but it's not perfect by any means.
And this is another gotcha. Even if Xapian did support tracking byte
offsets of terms, for what you want to do the offsets would need to be
offsets in the text version of the PDF, not the PDF itself. And where
is the text version of the PDF stored?
> As HD are now low cost and as everybody today is looking for a google
> meaninful result listing with highlighted terms, I would also store a
> such index. But maybe is there another way ?
I don't know what Omega will let you do, but using Xapian's API, when
you add a term to a document you can optionally give it positional
information. The intent I'm sure is for the position to be word
position. Xapian uses this for proximity searching. You could instead
use byte offsets, but your options for proximity go away. There's
little meaning in "documents where 'foo' and 'bar' are within XX bytes
of each other".
More information about the Xapian-discuss