[Xapian-discuss] TermGenerator and SimpleStopper

Olly Betts olly at survex.com
Thu Jun 28 14:04:18 BST 2007


On Thu, Jun 28, 2007 at 11:34:06AM +0100, Tom Mortimer wrote:
> I'm using SimpleStopper with TermGenerator in a Python indexing  
> script, in an attempt to keep my index size down (currently 30K per  
> doc, and I have 200 million docs to index, which I think implies  
> 6TB.)  However, unprefixed (positional?) terms are not affected by  
> the stopper, though Z-prefixed terms are.
> 
> I assume this is intentional for phrase queries

Yes, that's exactly the idea.

> but I need to reduce  my index size drastically. Is it possible to
> generate positional  terms, filtered with a stoplist, and not generate
> the Z terms?  Or  should I just write my own term generator?

There should probably be more configurability in TermGenerator, but
1.0 was already later than hoped, and more options means more
combinations to test, so the current implementation is probably more
hard-wired than is ideal.

There's an option in the code for "hard stopping", but no exposed API
for it yet.  If you edit queryparser/termgenerator_internal.cc and
change stop_mode to STOPWORDS_IGNORE then stop words won't be indexed
at all.

If you don't set a stemmer, you'll only get unstemmed terms, without
a Z prefix.  If you only want stemmed terms without a prefix, you'll
need to tweak the code, at least for now.

Cheers,
    Olly



More information about the Xapian-discuss mailing list