[Xapian-discuss] complete STOP PHRASES vs STOP WORDS

Olly Betts olly at survex.com
Thu Feb 13 05:54:50 GMT 2014


[A belated reply, but hopefully still useful....]

On Tue, Dec 03, 2013 at 02:35:27PM +0000, robin wrote:
> THE PROBLEM / TASK
> we send out a lot of batch letters where 90% of the content has no 
> significance to the real content which we may want to search. I know that
> there is the standard STOP WORD removal, so is there also a chance to have
> some kind of lookup table / dictionary / ... where you define complete
> phrases which are substracted before the content is indexed by xapian?

TermGenerator doesn't currently support stopping phrases, though if you
tokenise by yourself (instead of using TermGenerator) you can handle the
text exactly as you want.

A patch to add stop phrases might be more widely useful, though the
details of how exactly it would work aren't clear to me - e.g. would
it just work on the tokenised words, or would punctuation and
capitalisation matter?  It also shouldn't impose an overhead if
you aren't using it.

But I think in your case I wouldn't view the problem as stopping
phrases, but rather a process of de-templating, which is often better
done with the source document as it contains more information than
the extracted plain-text does.

You don't say what format these documents are in, but a similar case is
indexing a website with common navigation, legal statements, etc on each
page.  There things like the HTML structure and/or CSS classes can
provide a simple way to pick out the text you want to index.

Cheers,
    Olly



More information about the Xapian-discuss mailing list