[Xapian-discuss] UTF8 support plans (without stemming)
Sam Liddicott
sam at liddicott.com
Thu Apr 28 16:52:37 BST 2005
Craig Macdonald wrote:
>> Well, these two querstions relate to each other: Xapian is strong in
>> 'probabilistic IR' and that approach kind of needs some sort of
>> stemming.
>
> I dont totally agree with that. We've had some success in applying
> only the first two steps of the English (Porter) stemmer
> to large English web corpuses. Many submissions to last year's TREC
> Terabyte track didnt use stemming at all.
> http://www.google.co.uk/search?q=2004+trec+terabyte+stemming
> It would also appear to be a similar approach to what Google is doing.
> The first two steps only drops plurals and tense suffixes.
>
When you are looking for enough hits in a near infinite document set the
drop in recall can be hidden, because the user never knows what they
miss out on - as long as there are enough results - because they never
were going to look at all good results anyway.
In a smaller document set or where the user knows what results they are
expecting (sometimes the same thing) this can become very annoying.
Sam
More information about the Xapian-discuss
mailing list