[Xapian-discuss] Stopper Problems

Olly Betts olly at survex.com
Thu Mar 8 15:51:24 GMT 2007


Please keep discussion on the lists - others may be interested too!

On Thu, Mar 08, 2007 at 09:09:21AM +0000, Colin Bell wrote:
> Thanks Olly ,as always, I really appreciate the time you take to  
> answer these questions. I couldn't post the whole Stopper because the  
> mailing list keeps putting it on hold because it says the body of the  
> message is too long. The stopper just contains more words for each  
> letter o the alphabet.

I was really just indicating why I hadn't actually tried the code, but a
cut-down example or a URL for the example would have worked.

> It turns out that it was the punctuation which was causing some of  
> the problems. If a word had a comma after it or if the word had an  
> apostrophe in it.

In the stopword list or the query?

Currently (in 0.9.x) apostrophes are treated as "phrase generators", so
<doesn't> is the same as <doesn-t>.  This is really a misfeature, as
it produces phrase searches where we don't need them (and some of the
slower cases of phrase searches too) and it's not useful to be able to
search for the two parts separately, except in one case - the possessive
<'s> (e.g. <Olly's>) which is better handled by the stemmers (and the
latest Snowball English stemmer does this).  In SVN trunk, an apostrophe
between two word characters is included in the term.

So if your stopper would never get passed <doesn't> currently, it will
be offered <doesn> and <t> instead.  But a query string containing
a word with a comma after it should work as expected.  For example, the
query string "the, comma" should cause <the> and <comma> to be passed to
the stopper.

> Is there anyway to adjust my stopper to stop any terms shorter than 3  
> chars ?

Just insert this before any other checks (assuming you mean strictly
shorter of course):

    if (t.size() < 3) return true;

Cheers,
    Olly



More information about the Xapian-discuss mailing list