[Xapian-discuss] quick-and-dirty web search for a bunch of PDFs?

Wed May 17 12:30:42 BST 2006

A general word of caution when using pdftotext to index things.  If you 
pdf documents have multiple columns, the locality of terms may be 
incorrect.  It was my experience that pdftotext paid no attention to 
colums so the last word of the first column, first sentence is followed 
by the first word, second column, first sentence.  This will cause 
problems when searching for  near by works.

For instance if we had a document that looked like
  Column 1                                       Column 2
this is a test of the near earth      discussions of this nature need to be
monitoring system.                     continued as quickly as possible.
...                                                   ...

A search for "earth monitoring" would fail because "earth" is followed 
by "discussions".    The only way I found to avoid this was to convert 
the document into postscript from pdf and then from postscript into 
text.  Apparently pdf2ps knows how to handle mulitple columns. 

The man page for pdftotext implies that it will "´undo' physical layout 
(columns, hyphenation, etc.) and output the text in reading order." but 
that was not my experience, at least not for the pdf files I was indexing.

Jim.

Olly Betts wrote:

>Omega's "omindex" indexer will index PDF files out of the box (just make
>sure you have pdftotext installed.)
>
>  
>