[Snowball-discuss] FW: [Xapian-discuss] Some performance questions

Arjen van der Meijden arjen@glas.its.tudelft.nl
Fri May 30 16:04:01 2003


> Martin Porter wrote:
> 
> My feeling is that they should 
> be indexed, for phrase searching and so on, but removed from 
> queries under certain circumstances and suppressed from term 
> expansion lists.
I think xapian/omega already does this? (Except, of course, that it uses
a hardcoded english stopword list, instead of the snowball stopword
list, which is a known issue anyway.)

> Stopwords should be identified prior to stemming. Then a 
> distinction can be made between terms that are stopwords and 
> terms that stem to stopwords: they could give rise to 
> slightly different term variants.
This can become quite difficult, since the stopwords would still be
included in the index and thus the wordcount would still be skewed?
And if they are excluded from the indeces, that might break the entire
algorithm used in the searchengine.

> Another point is the splitting at hyphen. This is advisable 
> in English. What is best for Dutch?
I think it's correct, it's supposed to indicate a concattenation of two
terms to create a single new term, except that the terms couldn't be
really concattenated since that would mean the new word had a different
meaning or problems with the pronounciation. Like: "autoonderdeel",
which is incorrect and should've been written like: "auto-onderdeel"
(car part).
Although there are always terms that should've been indexed 'as a whole
term', but I can't think of examples where breaking them apart would
really be a bad thing. Perhaps "e-mail" ('email' is also correct in
Dutch) is, but only because it's bad for performance ;)

A more difficult issue with the Dutch language is that we concattenate
terms together to make a single word, where English just write down two
terms.
'stopword list' would in Dutch be: 'stopwoordenlijst', it's incorrect to
write down: "stopwoorden lijst". Or 'car radio' -> "autoradio" and not
"auto-radio" or "auto radio".

But please note, I'm a just a computerscience student, not a
Dutch-language scientist/student :)

> What is needed for the other languages is 
> feedback on search failures caused by the stemmers. It is 
> important to realise that this feedback must be based on user 
> experience.
Well, the Dutch language will be a pain anyway. With words that have
different meanings based on their context or spell differently based on
their meaning (although the base/stem was the same).

> You can't deduce these failures by scanning 
> weblogs, thinking about grammar, and the like. 
I know, the example ('dane' -> 'dan') I sent was just one that struck me
as wrong and is probably/hopefully one of the worst cases, since it's a
company name (which is in Dutch "bedrijfsnaam" ;) ) that gets translated
into a stopword.

If I've time, I'll try to gather more data where searches really fail
because of their stemming. Most of the time they'll fail because users
just don't know how to search effectively, when they use only one term
there isn't much "we" can do to help them,, even if his term was stemmed
to a less specific term :)

Regards,

Arjen