[Xapian-discuss] indexing for phrase searching and constructing queries

Mon Feb 5 22:59:37 GMT 2007

On Fri, Jan 26, 2007 at 08:06:14AM +0000, Richard Jolly wrote:
> I've changed my indexes. Previously I was doing two things I thought 
> would improve matching. First, I removed short words from the source 
> text.

Not including stopwords in the search typically improves retrieval, but
dropping them at index time can be a problem.  For example, it's then
not possible to search for the Shakespeare quote "to be or not to be",
the dance the "can can", the band "the the", or perhaps even the month
of May!  Stop words are also important in some phrase searches.

Because stop words are common, the posting lists and positional
information for them can be compressed very effectively, so the approach
I tend to recommend is to index them but not use them by default in
queries.  But if a stopword is used in a phrase or preceded by "+" then
the information is still available to search for it.  The database will
take up more disk space, but the information won't actually be read
during a search unless it's needed.

> Then I lower casing and steming each word in it - but added all 
> the combinations as posts. Basically I was trying to add as many 
> possible matches as I could.

Case-folding generally makes sense.  There are a few examples where it
is perhaps a problem, such as NeXT the defunct computer maker (though
"next computers" allows that to be found) and LaTeX the typesetting
software vs. latex rubber.  But generally people aren't consistent about
following anal capitalisation of such examples anyway.

Adding both stemmed and unstemmed forms is probably useful.  Stemming
improves retrieval results in many cases, but it can conflate words
which some users want to distinguish, and proper nouns can be
problematic (e.g. "Tony" and "Toni" both stem to "toni").  Users may
be suprised to have phrase searches match stemmed forms of the words in
the phrase, though the number of "false" matches is probably very small
for most phrases.

But if you do add both stemmed and unstemmed forms, it's best to keep
them separate, say by adding a prefix to one or the other or the
frequencies of some terms will get skewed which will probably degrade
the result rankings.  Omega's current approach is that the unstemmed
("raw") form gets an "R" prefix but it only adds these for terms which
are capitalised.

> Now I lower case all text, split on whitespace to form words, then 
> remove punctuation. I might put stemming back in, but I haven't yet. Is 
> there best practice for this, or common strategies?

Stemming is generally useful (though it's harder to do well in a mixed
language document collection).

> I'm particularly curious about punctuation.

I tend to think most punctuation should act as a word split, but there
are exceptions.  In English at least, apostrophes deserve special
handling.  A "." may do too (you might want to treat "I.B.M." the same
as IBM) and perhaps also "&" (e.g. AT&T).  Possibly "_" should just be
treated as a word character too (as it's allowable in identifiers in
most computer languages), but phrase searching can allow identifiers
to be found while also matching substrings.

You may also want to consider email addresses, URLs, and other
identifier syntaxes.  Omega's current approach is to rely on phrase
searching to provide these, which works fairly well.

> I guess the general lesson is that whatever you do do the source text 
> to index it should be also be done to the query entered.

Pretty much.  You need to make sure the two are handled in compatible
ways, though they needn't be identical (e.g. you might index stemmed
and unstemmed forms, but you needn't then search for both).

> >I take it you have a "name" box and a "text" box?  If so, you'd
> >ideally want to parse each separately using a QueryParser object with
> >one set to default to "name:", but currently I don't think you can
> >(I'll take a look in a week or so when I'm back from holiday as this
> >should be easy to do).

Try this patch (which applies with a little "fuzz" to 0.9.9):

http://oligarchy.co.uk/xapian/patches/xapian-queryparser-default-prefix.patch

This adds an extra optional parameter to QueryParser::parse_query().

Cheers,
    Olly