[Xapian-discuss] Newbie question: How to extract 'date modified' from path when indexing?

Olly Betts olly at survex.com
Wed Apr 1 08:52:49 BST 2009


On Tue, Mar 31, 2009 at 11:55:47PM -0400, Deron Meranda wrote:
> On Tue, Mar 31, 2009 at 8:36 PM, Bill Hutten <bill at hutten.org> wrote:
> > The files are stored in a consistent structure, for instance file
> > "foo.html" might be in "archives/2006/07/foo.html"  In this example, I
> > would like to be able to extract the 2006/07 value from the path during
> > indexing and use that as the date that Xapian/Omega uses to search on.
> 
> Do you have access to the webserver files at all?  Because the best
> solution is simply to change the timestamp of the underlying files.  That
> would benefit not only your Xapian indexing, but also all the other HTTP
> goodness; such as working with whatever other types of spiders or
> indexers may be crawling the site, HTTP proxies and caches, etc.

Yes, this is a very sensible approach.

> If it's Unix/Linux, changing the file timestamps would be quite easy.
> You want to look at the "time" command.  Or I could provide you
> a little script to do that.

Actually, "time" times how long a command takes - see "touch" for
changing file timestamps.

> As a second choice, if say this is an Apache webserver and you
> can add some configuration (either the main config file or the
> per-directory .htaccess files); then you can force Apache to
> lie about the file's date.  This is easiest though if you only have a
> few directories (which if it's one directory per month is doable).
> Again, since the webserver would be sending out the correct
> date, it also benefits other spiders, indexers, HTTP caches, etc.

But omindex indexes from the filesystem, so the date apache reports
doesn't matter to it.

> As a last resort, you're going to have to modify the indexer itself
> to overrule what it learns from the HTTP date, and instead extract
> a date out of the URL pattern.

See Omega's omindex.cc for where to modify for this approach.

Cheers,
    Olly



More information about the Xapian-discuss mailing list