[Xapian-discuss] quick-and-dirty web search for a bunch of PDFs?
Tim Brody
tdb01r at ecs.soton.ac.uk
Wed May 17 13:10:04 BST 2006
Jim Lynch wrote:
> A general word of caution when using pdftotext to index things. If you
> pdf documents have multiple columns, the locality of terms may be
> incorrect. It was my experience that pdftotext paid no attention to
> colums so the last word of the first column, first sentence is followed
> by the first word, second column, first sentence. This will cause
> problems when searching for near by works.
>
> For instance if we had a document that looked like
> Column 1 Column 2
> this is a test of the near earth discussions of this nature need to be
> monitoring system. continued as quickly as possible.
> ... ...
Version 3 is supposed to to do this, but I agree it's flakey.
Try using the -raw option, which outputs the text in 'stream order'.
Another possibility is 'PDFBox' (a JAVA API for PDF), which comes with a
text extraction tool:
http://www.pdfbox.org/
Tim.
More information about the Xapian-discuss
mailing list