[Xapian-discuss] Can stemmed and non-stemmed terms mixed in one query?

Mon Sep 12 02:00:39 BST 2005

On Fri, Sep 09, 2005 at 06:25:56PM +0700, Ronny Perdana wrote:
> I just installed 0.9.2 (on Cygwin, with new safewindows.h), is it possible 
> to mix stemmed and non-stemmed terms in a query? For example, the query:
> "transport" +rule
> expects to find documents that contains the word transport (not 
> transportation, transporting, etc) and any form of the word rule (rule, 
> ruling, rules, etc)
> 
> The example files show how to index documents either stemmed or not-stemmed, 
> but I cannot find a way to mix the two. 

The approach Omega takes is to generate terms from unstemmed words by
lowercasing and prefixing with "R" (in addition to generating stemmed
terms).  It only does this for capitalised words though.

QueryParser similarly handles capitalised words by turning them into "R"
prefixed terms.  This ought to be configurable (so that you can use
QueryParser with your own indexer if you don't want to generate R terms)
but that's not been done yet.

QueryParser doesn't treat quoted phrases as meaning unstemmed.  To be
honest, the "capitalised words are unstemmed" scheme isn't ideal - it's
good for proper nouns (names and places) which stemming can cause
problems with, but sometimes a query contains capitalised words for
no particular reason - e.g. if someone pastes a phrase from a document.

But triggering an unstemmed query for quotes would be more problematic
- at present "transport" would then only match "Transport" in documents.

The alternatives which come to mind are to generate R terms for all
words (which is going to greatly increase the size of the database) or
to not stem any terms in the index and perform the stemming work at
search time (effectively you need to stem each search term, then "unstem"
to the list of possible forms and combine these with an OR-like query
operator, but with appropriate handling of term frequency and wdf).

This second approach probably has merit, and I'm already intending to
investigate it.

Cheers,
    Olly