[Snowball-discuss] A New Stemmer for Pali

Olly Betts olly at survex.com
Mon Apr 29 01:07:07 BST 2024


On Sun, Apr 28, 2024 at 05:11:58PM +0700, Khemarato Bhikkhu wrote:
> I'm a volunteer for SuttaCentral.net working on improving their search.
> It's currently using ArangoDB, so I thought a good first step might be to
> teach ArangoDB to natively "understand" Pali by adding a Pali stemmer to
> Snowball.
> 
> Here's my first stab at it:
> https://github.com/snowballstem/snowball/pull/197
> 
> Any and all feedback would be greatly appreciated.  I'm especially curious
> to know if Snowball supports separating compound words (by adding a space
> between components?)

No, with some caveats.

Snowball the string processing language could do that if the splitting
can be done algorithmically, but the current stemmer API framework is
that a word gets passed in and a single stem returned.  (There is a Latin
stemmer example which returns both a noun-stem and a verb-stem, but that
doesn't fit this framework so isn't actually included as a stemmer that
can be used directly.)

The special case where part of a compound word isn't useful to index in
itself (for example, because it's a stop word) could be handled by
removing it, effectively handling it like a prefix, suffix or infix
even if it's not strictly speaking one of those.

ICU might be a better place to look at handling compound words.  It
supports word boundary identification, but it doesn't seem to have
specific rules or a dictionary for Pali currently:

https://github.com/unicode-org/icu/tree/main/icu4c/source/data/brkitr/rules
https://github.com/unicode-org/icu/tree/main/icu4c/source/data/brkitr/dictionaries

> and also how polished an algorithm should be to get
> checked in.  Do you want the algorithms to be polished and stable before
> they get merged, or do you support a process of more continuous improvement?

We really want them to be polished and stable before merging.  Changes
once released are potentially disruptive as they can require a full
reindex of a search system using them (because the stemming used at
index time and search time needs to match or else results are missing
or potentially even wrong results returned).  Therefore we are generally
conservative about changing existing algorithms.

Opening a PR before this point for feedback is fine though.  I'll take
a look when I get a chance.

Cheers,
    Olly



More information about the Snowball-discuss mailing list