[Xapian-discuss] [ NUMBER OF SAMPLE ]

Wed Jul 21 19:51:15 BST 2004

On Jul 21, 2004, at 12:59 PM, Boris Meyer wrote:

> I'm diving into the Api, looking for some methods to retrieve this 
> offset.

It ain't there!  :)  The best you can do is get the "positional" data, 
which I'm willing to bet is "word position" with Omega.

>> Right now one must re-parse the document, joining up with the terms 
>> list from the result to find and highlight any/all hits, let alone 
>> context extraction.  A fairly expensive operation if you're doing to 
>> do this on a "summary display" of many documents.
>
> Yes a very consuming process, especially when the average size of the 
> documents I would have to parse is known, 3Mo (Pdf), don't forget the 
> x10 results/page please ;-).

PDF text extraction is a pain in the ass.  I've got a handwritten PDF 
parser (in Java) that does a decent job of text extraction (better than 
xpdf in raw mode, in my opinion), but it's not perfect by any means.

And this is another gotcha.  Even if Xapian did support tracking byte 
offsets of terms, for what you want to do the offsets would need to be 
offsets in the text version of the PDF, not the PDF itself.  And where 
is the text version of the PDF stored?

> As HD are now low cost and as everybody today is looking for a google 
> meaninful result listing with highlighted terms, I would also store a 
> such index. But maybe is there another way ?

I don't know what Omega will let you do, but using Xapian's API, when 
you add a term to a document you can optionally give it positional 
information.  The intent I'm sure is for the position to be word 
position.  Xapian uses this for proximity searching.  You could instead 
use byte offsets, but your options for proximity go away.  There's 
little meaning in "documents where 'foo' and 'bar' are within XX bytes 
of each other".

eric