[Snowball-discuss] New member questions

Olly Betts olly at survex.com
Wed Apr 23 04:59:58 BST 2025


Apologies for the slow reply.  I saw Martin had responded to you
before, but only just spotted there was a new question in your reply.

On Sat, Jan 25, 2025 at 11:27:07AM +0200, Harri Pasanen wrote:
> Btw.  I ran into something I would think is a bug of one word in English:
> knotty gets stemmed to knotti, spotty to spotti.  But knot is a knot and
> spot is a spot.   I'm not sure if playing whack-a-mole for each word is
> useful in generating an issue ticket in github is useful though.

Specifically regarding "knotty" in English, while it can mean "with
lots of knots" ("A knotty plank of pine") probably the more common
usage is a metaphorical one ("Stemming is a knotty problem") where
it means "difficult" and conflating with "knot" is not helpful there.

In general it is OK for a stemmer not to conflate some words like this
(and inevitably there will be many cases it doesn't as languages are
irregular and evolve over time).  A stemmer which doesn't map "spotty"
and "spot" to the same stem can still usefully improve retrieval
performance overall, and the incremental difference handling one extra
word makes to retrieval is tiny.

We sometimes add new rules if we can identify a significant class of
words that can be handled without causing problems.

The more problematic case is where the stemmer conflates words with
unrelated meanings, for example "communication", "communism" and
"community" would stem together except for an exception added to the
stemmer many years ago.  We are likely to add an exception for such a
case even if the number of affected words is very small, because it
negatively affects the end-user search experience if irrelevant
documents are retrieved.

Cheers,
    Olly



More information about the Snowball-discuss mailing list