[Xapian-discuss] Help on indexscript

Olly Betts olly at survex.com
Thu Apr 27 13:02:49 BST 2006


On Thu, Apr 27, 2006 at 12:58:27PM +0200, Hermann Rokicz wrote:
> I'm trying to index a set of usenet-postings. The script I currently use:
> 
> mid    : unique=Q boolean=Q field=mid

To protect yourself from very long message-ids, you ought to either
use "truncate=240" (or something shorter) or use the "hash" command (new
in 0.9.5) which hashes the end of very long boolean terms:

mid : hash unique=Q boolean=Q field

> title  : truncate=100 weight=3 index=S field=title

I'd probably index the whole title, and just truncate what is stored in
the field (remember that the commands are interpreted from left to right
as written).  But I can see arguments for restricting how much of a long
title is indexed.

> email  : truncate=100 lower boolean=X field=email

If you're planning to search with Omega, using "X" alone as a prefix is
probably unwise.  Either use "X<something>", or just use "A" (which is
usually used to mean "Author").

> groups : truncate=200 boolean=G field=groups

Presumably there can be multiple groups?  Currently scriptindex doesn't
allow you to generate multiple booleans from one field.  The best fix
currently is probably for your conversion script to produce one entry
for all the group names (to go in the field) and also one entry per
group (to be indexed as boolean=G).

You maybe want to lowercase here if group names are case-insensitive.

> date   : field date=unix field=date

You don't need both "field" and "field=date" - they both mean the same
since the default is to call the field in Xapian the same as the field
in the input file.  It should be harmless to repeat though.

> body   : truncate=1000 weight=1 indexnopos field=body

Again, you probably want to index the untruncated field.  Also
"weight=1" is the default, so superfluous.  So I'd probably use:

body : indexnopos truncate=1000 field

Incidentally, if you're using 0.9.5, you should apply this patch to
get weights to work:

http://article.gmane.org/gmane.comp.search.xapian.general/2752

> Intention of the index is, to search the postings and restrict the
> searches to date, groups or email-adresses.

If you'd find it useful, you're welcome to a copy of the indexer we use
for gmane (http://search.gmane.org/), which has a similar purpose.

The intention is to release it under the GPL but I've not got around to
hammering the build system into packagable form.  But I can just tar up
the sources by hand for now.

Cheers,
    Olly



More information about the Xapian-discuss mailing list