[Xapian-discuss] docx support
olly at survex.com
Thu Jul 24 15:21:46 BST 2008
On Thu, Jul 24, 2008 at 09:02:48PM +0930, Frank Bruzzaniti wrote:
> One question I have re omindex, when I run a crawl I see:
> Indexing "/New Spreadsheet.ots" as
> application/vnd.oasis.opendocument.spreadsheet-template ... updated.
> I assume omindex uses OpenOffice to do the conversion.
No, it just pulls out the XML from the zip wrapper and parses it. Since
formatting isn't important, this isn't hard to do.
> I can open *.docx with OpenOffice and save as a *.txt how come you don;t
> use open office for the bulk of your conversions?
OpenOffice is rather heavyweight compared to running unzip and then a
simple XML parser, both in terms of a dependency to have to install and
memory used while indexing. I bet nearly every Linux server already has
unzip, but I suspect most don't have OpenOffice. Certainly the one
I'm writing this mail has unzip but not OpenOffice.
More information about the Xapian-discuss