[Xapian-discuss] TermGenerator and SimpleStopper
olly at survex.com
Thu Jun 28 14:04:18 BST 2007
On Thu, Jun 28, 2007 at 11:34:06AM +0100, Tom Mortimer wrote:
> I'm using SimpleStopper with TermGenerator in a Python indexing
> script, in an attempt to keep my index size down (currently 30K per
> doc, and I have 200 million docs to index, which I think implies
> 6TB.) However, unprefixed (positional?) terms are not affected by
> the stopper, though Z-prefixed terms are.
> I assume this is intentional for phrase queries
Yes, that's exactly the idea.
> but I need to reduce my index size drastically. Is it possible to
> generate positional terms, filtered with a stoplist, and not generate
> the Z terms? Or should I just write my own term generator?
There should probably be more configurability in TermGenerator, but
1.0 was already later than hoped, and more options means more
combinations to test, so the current implementation is probably more
hard-wired than is ideal.
There's an option in the code for "hard stopping", but no exposed API
for it yet. If you edit queryparser/termgenerator_internal.cc and
change stop_mode to STOPWORDS_IGNORE then stop words won't be indexed
If you don't set a stemmer, you'll only get unstemmed terms, without
a Z prefix. If you only want stemmed terms without a prefix, you'll
need to tweak the code, at least for now.
More information about the Xapian-discuss