[Xapian-discuss] omindex doesn't check last_mod
James Aylett
james-xapian at tartarus.org
Tue Aug 8 13:58:18 BST 2006
On Mon, Aug 07, 2006 at 09:10:48PM -0700, Michael Trinkala wrote:
> I recommend storing the last modified time and the document MD5 in
> the value table. I use both to determine if re-indexing is
> necessary. First comparing the last modified time and if necessary
> the MD5 (some files on our system get touched without having their
> content modified).
That's neat. I'd recommend only calculating the MD5 up to the first N
bytes of the file (where N is an appropriate number for your data and
hardware).
> For document lookup during indexing I use a unique key (MD5 of the
> full filename) stored in the term table (prefixed with F followed by
> the 16 byte binary MD5).
You can (and probably should) use Q for that, as it's a
document-unique identifying term. If it's a web-centric app, you're
better off using a URI if at all possible - cool URIs don't change,
whereas file paths do.
Of course, for a file-centric app such as desktop search, MD5 of the
filename is just as good (although you can convert to file: schema
URIs).
> I will gladly contribute these changes and others if the team is
> interested. I will get a list up on xapian-devel to figure out what
> should/shouldn't be included.
>
> As for excel support check out xls2cvs and catppt does a nice job
> with powerpoint http://www.45.free.net/~vitus/software/catdoc/
Cool :)
James
--
/--------------------------------------------------------------------------\
James Aylett xapian.org
james at tartarus.org uncertaintydivision.org
More information about the Xapian-discuss
mailing list