[Xapian-discuss] Autocomplete for phrases

Thu Dec 24 12:03:12 GMT 2009

On Mon, Dec 07, 2009 at 05:18:41PM +0000, Nick Griffiths wrote:
> For example, say we have a set of document titles, like so:
> 
> [1] - "welcome to the jungle",
> [2] - "welcome to thesaurus",
> [3] - "welcome all to the jungle",
> [4] - "all jungle"
> 
> When passing the following queries to parse_query with QUERY_PARTIAL, we
> are seeing...
> 
> w                     -> 1, 2, 3

You probably don't want to try to offer completions for a single character
query (just because there are generally too many possibile completions to
be able to do so usefully).

> welcome to the        -> 1, 2, *3
> jung                  -> 4, 1, 3
> all jungle            -> 4, *3, *1

I don't see how the last example can match 1 - is that a typo?

But otherwise this is as expected - what FLAG_PARTIAL does is to wildcard
expand the last term if it abuts the end of the query string.

> Which for our purposes is returning non-matching results (marked with an
> asterisk).  If we switch to using the QUERY_PHRASE flag, we avoid the
> extraneous results, but we lose the partial results.  Also, with a large
> database (4mil documents) the first query takes an incredibly long time.

There isn't a QUERY_PHRASE flag - there is FLAG_PHRASE (and you wrote
QUERY_PARTIAL above where you meant FLAG_PARTIAL), but that controls
support for quoted phrases so shouldn't change anything here.

Do you mean you are setting the default op to OP_PHRASE?  That's not going
to interact efficiently with FLAG_PARTIAL when the last term has many
possible completions.  Also it defaults to a window size such that the
words don't have to be adjacent (so `welcome jungle' will match 1 and 3).
If OP_PHRASE were improved to work more efficiently with subqueries which
weren't just terms this might be a viable approach - the window size could
be made configurable fairly easily.

> One idea is to add all possible incomplete versions of each term to the
> document's term list.  For the term 'welcome' we'd also add  'Sw',
> 'Swe', 'Swel' ... 'Swelcome'.  That way we can adjust the user's query
> to insert the prefix on the final word and then search via a standard
> phrase search.  Not sure what the performance or index size implications
> of this might be, but we'd be willing to have a shot if this is a good
> approach.

Modifying the user's query string before parsing isn't something we
recommend as you need to duplicate the behaviour of QueryParser to do
it safely.

It is likely to take a lot of space, as you are going to end up generating
a lot of terms and need to index them all with positions.

It's quite a lot like the wildcard acceleration idea here, but that would
have the advantages of note having to mess with query strings, and not
having to index every possible completion:

http://trac.xapian.org/ticket/207

Cheers,
    Olly