[Xapian-discuss] tiff / image pdf filter
Frank J Bruzzaniti
frank.bruzzaniti at gmail.com
Thu Mar 19 05:06:47 GMT 2009
I've been experimenting using tesseract to OCR tiff's with omega just
using the tesseract binary package from Ubuntu.
The one issue I find is that tesseract is sooo slow.
One work around so ocr'ing doesn't hold up omindex would be to maintain
a separate instance of omindex and a separate database of ocr'd data
then allow them both to be searched via the "stub database" method. I'd
definatly wanna use last_mod patch here so I don't have to re-ocr.
Dose this sound reasonable, if anyone has any better solutions I;d love
to hear of them. Once I've got it sorted I'll submit a patch. Maybe we
could have a flag for omindex to it knows if it's designated just to ocr
tiff's.
I guess we could also ocr image pdf's if they comeback with no data from
the regular pdf filter. E.g. If you run omindex --tiff --ipdf then it
will only ocr tiff's and image pdf's by emploing the regular pdf filter
if it returns data then skip it if it dosen't then ocr it.
Frank
More information about the Xapian-discuss
mailing list