[Xapian-discuss] UTF8 support plans (without stemming)

Olly Betts olly at survex.com
Thu Apr 28 20:06:39 BST 2005


On Thu, Apr 28, 2005 at 04:52:37PM +0100, Sam Liddicott wrote:
> Craig Macdonald wrote:
> >We've had some success in applying 
> >only the first two steps of the English (Porter) stemmer
> >to large English web corpuses. Many submissions to last year's TREC 
> >Terabyte track didnt use stemming at all.
> >   http://www.google.co.uk/search?q=2004+trec+terabyte+stemming
> >It would also appear to be a similar approach to what Google is doing. 
> >The first two steps only drops plurals and tense suffixes.

Perhaps it would be useful to have a "porterlite" stemmer which
implemented this.

> When you are looking for enough hits in a near infinite document set the 
> drop in recall can be hidden, because the user never knows what they 
> miss out on - as long as there are enough results - because they never 
> were going to look at all good results anyway.

Indeed - if there are a lot of possible answers, precision matters much
more than recall.  Google appears to only automatically turn on stemming
(or synonyms) for "hard" queries, which makes some sense from this point
of view.

Cheers,
    Olly



More information about the Xapian-discuss mailing list