[Xapian-discuss] scriptindex on an internet crawl

Georges Dupret gdupret at dcc.uchile.cl
Wed Jun 22 20:21:32 BST 2005


Hi!

I have a crawl of the chilean internet on disk in raw text. I have a
field for the url of the document, for the title and for the content.

As a first try, I used scriptindex with

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
uid : field=uid boolean=XUID unique=XUID
title : field=title index
value : index
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

in the command file and, as input file

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
uid=1
title=us spy plane crashes in sw asia
value=a us air force u-2 spy plane has crashed in south-west asia
killing
=the pilot, the us military has said.  the crash occurred at 2330 gmt
=on tuesday, when the pilot was returning to base after completing a
etc.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

and it works fine for probabilistic searches.

My questions are:
1) what should I input in the search field to search only the title? I
was expecting that something like "title: plane" would work, but it
doesn't.
2) how should I do to see the original url of the documents retrieved
such that if I click on the hyperlink, I am redirected to the original
document (i.e. not the document I have in my copy of the crawl). In a
first try, I inserted in the command file url : field=url boolean=XURL
unique=XURL and in the input file: url=www.dcc.uchile.cl/~gdupret for
example, but scriptindex start using 100% of the CPU and never finishes.

Thank you in advance for your help

Georges





More information about the Xapian-discuss mailing list