[Xapian-discuss] Adding a Web Spider

rm at fabula.de rm at fabula.de
Fri Jul 2 11:38:33 BST 2004


On Fri, Jul 02, 2004 at 01:45:29AM -0700, Lee Johnson wrote:
> Hi,
> i have read the future of xapian thread today. One
> item is specifically is very interesting for me in
> that thread is adding a web spider. We all know that
> Xapian is not designed exclusively for that purpose
> but a web spider can increase greatly the usage of
> Xapian. I'm not a programmer but writing a web spider
> is rather simple wrt writing xapian itself. 

I happen to call myself a programmer and have written some
crawlers myself. No, writing a _good_ crawler is _far_ from
simple (you need a rather error-tolerant HTML/XHTML parser,
a good http lib, smart tracking of Etag headers and content
hash sums, and more and more a rather capable ECMA-script
interpreter (for those stupid javascript links ....).

I agree, it's pretty trivial to hack a _bad_ crawler in
one of the P languages but there are allready quite a few
out there in the wild you can catch and abuse :-)

> In turn,
> xapian can earn lots of users and those ones become
> familiar with xapian and so they use in other areas,
> they tell others about xapian and so on.

??? Is this really how it works? People use Xapian because
the need powerfull IR technology. Word of mouth is a bad
advisor in such areas. I doubt that mifluz is used more
often because of its use in htdig -- actually, i'm still
not shure anyone is using it outside htdig.

> I'm saying this because i also need a crawler for
> xapian. I have hand-picked rather big list of URLs
> (just URLs not the contents) and need a crawler to
> crawl all pages beneath the URLs and put the those
> content into a db. so i can use xapian to index and
> search that db. I'm very open to suggestions. 

Use Perl with the LWP lib  to fetch the documents,
parse them with the Perl libxml2 parser (that has a pretty
good html mode), use libxml2's Reader API to fetch all
URLs nd push them onto a stack of jobs. Use Xapian's 
Perl bindings to do the actual indexing. Nothing to
hard. But: if the resources you grab aren't on your servers
you might want to honor robot.txt and add delays to the
job queue, check for dynamic content etc.

> I looked
> at nutch, heritrix and larbin (this one probably just
> fetches the URLs not the contents i asked this to the
> developer but no answer yet) but with those i cannot
> use xapian (if i use one of them then probably i will
> use mnogosearch). another thing with nutch and
> heritrix is that they are written in java, imho, is
> not a good idea.

If i t does the job ...

> Also for those interested a good read may be
> http://acmqueue.com/modules.php?name=Content&pa=list_pages_issues&issue_id=12
> which devoted that month's issue to search topic.


> 
> Regards
> 

Just my 0.02$ 

  Ralf Mattes
> 
> 	
> 		
> __________________________________
> Do you Yahoo!?
> New and Improved Yahoo! Mail - 100MB free storage!
> http://promotions.yahoo.com/new_mail 
> 
> _______________________________________________
> Xapian-discuss mailing list
> Xapian-discuss at lists.xapian.org
> http://lists.xapian.org/mailman/listinfo/xapian-discuss



More information about the Xapian-discuss mailing list