[Xapian-discuss] Adding a Web Spider

James Aylett james-xapian at tartarus.org
Fri Jul 2 12:17:52 BST 2004

On Fri, Jul 02, 2004 at 12:38:33PM +0200, rm at fabula.de wrote:

> I happen to call myself a programmer and have written some
> crawlers myself. No, writing a _good_ crawler is _far_ from
> simple (you need a rather error-tolerant HTML/XHTML parser,
> a good http lib, smart tracking of Etag headers and content
> hash sums, and more and more a rather capable ECMA-script
> interpreter (for those stupid javascript links ....).

I'd echo that; i haven't written a crawler for indexing, but I've
written similar systems at work, and they tend to be fairly painful
However if we were to come up with some sort of modular design of
spider-crawler / indexer pair, and implement it well, it might indeed
help. But I do wonder how many people actually need something like
that? Surely most potential uses of an IR system will be working with
local data? (Larger institutions need spiders, so I can see the appeal
for consultancy companies, and I'll certainly support and offer
suggestions if anyone is going to write one. Just don't think it's
going to be easy :-)

> Use Perl with the LWP lib  to fetch the documents,
> parse them with the Perl libxml2 parser (that has a pretty
> good html mode), use libxml2's Reader API to fetch all
> URLs nd push them onto a stack of jobs. Use Xapian's 
> Perl bindings to do the actual indexing. Nothing to
> hard. But: if the resources you grab aren't on your servers
> you might want to honor robot.txt and add delays to the
> job queue, check for dynamic content etc.

If you use Python, there's a robots.txt implementation in the
library. Although IIRC it's buggy :-(


  James Aylett                                                  xapian.org
  james at tartarus.org                               uncertaintydivision.org

More information about the Xapian-discuss mailing list