[Snowball-discuss] FW: [Xapian-discuss] Some performance questions

Martin Porter martin_porter@SoftHome.net
Fri May 30 14:33:02 2003


Arjen,

Thanks for you email.

There is the issue of what the stemmer does, and how it is used. Stopwords
can be handled in a number of ways, depending on the IR model being adopted.
My feeling is that they should be indexed, for phrase searching and so on,
but removed from queries under certain circumstances and suppressed from
term expansion lists. (Admittedly this has not usually been the practice I
adopted when setting IR systems up for people ...) Stopwords should be
identified prior to stemming. Then a distinction can be made between terms
that are stopwords and terms that stem to stopwords: they could give rise to
slightly different term variants.

Another point is the splitting at hyphen. This is advisable in English. What
is best for Dutch?

As for what the stemmer does, only the English stemmer among the ones
presented on the Snowball site has been refined by experience. What is
needed for the other languages is feedback on search failures caused by the
stemmers. It is important to realise that this feedback must be based on
user experience. You can't deduce these failures by scanning weblogs,
thinking about grammar, and the like. 

Martin