[Xapian-discuss] Xapian with djvu files?

Olly Betts olly at survex.com
Mon Jan 14 08:25:41 GMT 2008


On Mon, Jan 14, 2008 at 07:25:45AM +0000, James Aylett wrote:
> On Mon, Jan 14, 2008 at 05:33:38PM +1100, John Pye wrote:
> 
> > I was wondering if there was any support in Xapian for DJVU files. These
> > are a nice alternative to PDF files -- much smaller file size, typically.

I did actually write a patch a while back for djvu.  I think I didn't
apply it because I only actually found a single example file with a text
layer, and that only had 20 words of ASCII text.  I like to have a
few decent test files (including some with non-ASCII characters) to give
me some confidence that a filter program actually works well.  It
doesn't seem to be a popular format (John is the first person to ask
about support for it), so I just left the matter.

> There isn't at the moment, but it would be fairly easy to add support
> into omindex(1) to use djvutxt to convert for indexing. djvutxt uses
> UTF-8 already, so something like the following in
> omindex.cc:index_file() around line 308 *should* do the trick
> (untested!):
[...]

Yes, that looks about right, except the mime-type I have listed in
/etc/mime.types is "image/vnd.djvu", which is the one registered with
IANA:

http://www.iana.org/assignments/media-types/image/vnd-djvu

> However I have to wonder why you want to - djvu is primarily an image
> file format, although it has support for mixed text and images. I
> admit I hadn't heard of it before now though, so perhaps the website
> [1] is a little misleading about the primary use.

"man djvutxt" is a little more helpful:

    Program djvutxt decodes the hidden text layer of a DjVu document
    inputdjvufile and prints it into file outputtxtfile or on the
    standard output.  The hidden text layer is usually generated with
    the help of an optical character recognition software.

Cheers,
    Olly



More information about the Xapian-discuss mailing list