[Xapian-discuss] scriptindex on an internet crawl
Olly Betts
olly at survex.com
Thu Jun 23 13:35:29 BST 2005
On Thu, Jun 23, 2005 at 08:14:05AM +0200, Arjen van der Meijden wrote:
> >>On Wed, Jun 22, 2005 at 03:21:32PM -0400, Georges Dupret wrote:
> >>
> >>>In a first try, I inserted in the command file url : field=url
> >>>boolean=XURL
> >>>unique=XURL and in the input file: url=www.dcc.uchile.cl/~gdupret for
> >>>example, but scriptindex start using 100% of the CPU and never finishes.
[...]
>
> Can't this be explained by just that scriptindex is very very slow?
In this particularly case, I hope you mean...
> I can imagine that a unique-check for a relatively long identifier with a
> relatively similar beginning can be very time consuming and/or results
> in quite a bit of more btree-work. At least compared to more evenly
> distributed identifiers.
The term length shouldn't make too much difference, but you could be
right that it's just being slow. Checking a unique id does slow things
down (and there's scope for improvement there), and checking two for
each document could conceivably be worse than double the overhead of
checking one.
Georges: try adding "-v" to the scriptindex command line for verbose
output. That will make it print a message each time it adds a document
so we'll see if it's actually making slow progress.
Cheers,
Olly
More information about the Xapian-discuss
mailing list