[Xapian-discuss] docx support

Thu Jul 24 12:51:05 BST 2008

Hi Frank

Xapian is an excellent tool and will do what you want very well, but  
it is a tool and not "shrink wrapped" product. It requires a lot of  
technical knowledge to implement and to use. Often developers will  
take Xapian , customise it, and create a user friendly front end for  
it. Omni index / Omega will do the job your after but needs  
customisation to suit your requirements. There are people on this list  
who are available as paid consultants to help you if don't have the  
technical background to implement Xapian. I'm sure they will make  
themselves available to you if you ask.

If you do want to get your hands dirty, then I'm sure everyone on this  
list will chip in to help you reach your goal.

Personally I use it to index everything from photos (using exiv data)  
to pdf, word, html etc. As long your able to extract raw text from  
something , then you can put it in Xapian.

Regards

Colin

On 24 Jul 2008, at 12:37, Frank Bruzzaniti wrote:

> I think you should then the numbnuts like  me could use it.
>
> You write your own indexer, wow.
>
> I was looking for a indexer that could index all my documents and  
> then give a simple "google" like webpage that I could customize.
>
> I wanted to be able to process searchable pdf's and office  
> documents, do you think xapian is the right project for me?
>
> Colin Bell wrote:
>>
>> Hi Frank
>>
>> You will have to get your hands dirty I'm afraid.
>>
>> I use my own indexer (which is very customised) and not Omega.  
>> Essentially you would have to integrate the example code I gave you  
>> into the Omega source and compile it. Otherwise you could use the  
>> code in your own indexer.
>>
>> I'm not sure if the Xapian mega coders responsible for Omega might  
>> find it worthy of official inclusion?
>>
>> On 24 Jul 2008, at 12:19, Frank Bruzzaniti wrote:
>>
>>> I have just setup my first test using omega + xapian, how would I  
>>> integrate what you have provided bellow?
>>>
>>> Colin Bell wrote:
>>>>
>>>> This is how I do it using tinyxml parser. My xml parsing may be a  
>>>> bit convoluted but it works. This can be applied for powerpoint  
>>>> and excel too.
>>>>
>>>> ...
>>>>  mime_map["docx"] = "application/vnd.openxmlformats- 
>>>> officedocument.wordprocessingml.document";
>>>>  mime_map["pptx"] = "application/vnd.openxmlformats- 
>>>> officedocument.presentationml.presentation";
>>>>  mime_map["xlsx"] = "application/vnd.openxmlformats- 
>>>> officedocument.spreadsheetml.sheet";
>>>>
>>>> ...
>>>>
>>>> //HANDLE DOCX WORD DOCUMENTS
>>>>  if (mimetype == "application/vnd.openxmlformats- 
>>>> officedocument.wordprocessingml.document"){
>>>>  string cmd = "unzip -p " + shell_protect(filepath) + " docProps/ 
>>>> core.xml";
>>>>  fileData+=parseWordXMetaData(mstdout_to_string(cmd));
>>>>  cmd = "unzip -p " + shell_protect(filepath) + " docProps/app.xml";
>>>>  fFileData+=parseWordXMetaData(mstdout_to_string(cmd));
>>>>  cmd = "unzip -p " + shell_protect(filepath) + " docProps/ 
>>>> custom.xml";
>>>>  fileData+=parseWordXCustomMetaData(mstdout_to_string(cmd));
>>>>  cmd = "unzip -p " + shell_protect(filepath) + " word/ 
>>>> document.xml";
>>>>  try{
>>>>  XmlParser xmlparser;
>>>>  xmlparser.parse_html(mstdout_to_string(cmd));
>>>>  dump = xmlparser.dump;
>>>>  } catch (ReadError) {
>>>>  cout << "\"" << cmd << "\" failed - skipping\n";
>>>>  return 0;
>>>>  }
>>>>  }
>>>>
>>>> string parseWordXCustomMetaData(string xml){
>>>>  string fileData = "";
>>>>  TiXmlDocument doc;
>>>>  doc.Parse((char *) xml.c_str());
>>>>  TiXmlElement* root = doc.RootElement();
>>>>  if(root){
>>>>  TiXmlNode * pParent = root->FirstChild();
>>>>  if(pParent){
>>>>  TiXmlNode * pChild = root->IterateChildren(pParent);
>>>>  for (pChild = pParent; pChild != 0; pChild = pChild- 
>>>> >NextSibling()){
>>>>  if(pChild){
>>>>  TiXmlElement* aElem = pChild->ToElement();
>>>>  if(aElem){
>>>>  string name = aElem->Attribute("name");
>>>>  TiXmlNode * pProperty = aElem->FirstChild();
>>>>  if(pProperty){
>>>>  TiXmlNode * pPropertyChild = aElem->IterateChildren(pProperty);
>>>>  for (pPropertyChild = pProperty; pPropertyChild != 0;  
>>>> pPropertyChild = pPropertyChild->NextSibling()){
>>>>  if(pPropertyChild){
>>>>  TiXmlElement* bElem = pPropertyChild->ToElement();
>>>>  if(bElem->GetText()){
>>>>  fileData+= "name:" + name + "=\"" + bElem->GetText() + "\"\n";
>>>>  }
>>>>  }
>>>>  }
>>>>  }
>>>>  }
>>>>  }
>>>>  }
>>>>  }
>>>>  }
>>>>  return fileData;
>>>> }
>>>>
>>>> Easy peasy ;-)
>>>>
>>>> On 23 Jul 2008, at 19:38, Frank Bruzzaniti wrote:
>>>>
>>>>> Is office 2007 formats like docx supported?
>>>>>
>>>>> Is there anyway to get xapian to index office 2007 formats?
>>>>>
>>>>> Is there any option/procedure to add a new mime plugin?
>>>>> For example if you rename a docx .zip you can retrieve text from
>>>>> document.xml
>>>>>
>>>>> Thanks
>>>>>
>>>>> Frank
>>>>>
>>>>> _______________________________________________
>>>>> Xapian-discuss mailing list
>>>>> Xapian-discuss at lists.xapian.org
>>>>> http://lists.xapian.org/mailman/listinfo/xapian-discuss
>>>>
>>