[Snowball-discuss] Malay vs Indonesian stemming

Olly Betts olly at survex.com
Tue Jan 13 20:32:38 GMT 2026


I'm wondering if we should advertise our current Indonesian stemmer
as being suitable for Malay and Indonesian.

Some internet research and a bit of testing makes me suspect we should,
but I'd really like input from someone more familiar with these
languages.  (While I implemented the Snowball Indonesian stemmer, it's
almost entirely based on a paper.  I didn't develop the algorithm,
though I did have to resolve some ambiguities in the algorithm
description in the paper, and my knowledge of the language mostly comes
from that.)

We've had a least one request for a Malay stemmer, although it was over
20 years ago (the original requester's address seems to be missing, but
probably a student's email address from 2004 would no longer work anyway):
https://lists.tartarus.org/pipermail/snowball-discuss/2004-August/000653.html

I'll briefly lay out the evidence I've found:

* The paper we're implementing the algorithm from ("A Study of Stemming
  Effects on Information Retrieval in Bahasa Indonesia" by Fadillah Z
  Tala) notes for example "We also looked at the morphological structure
  of Malay language words [1], since Bahasa Indonesia is very similar to
  Malay language".

* https://en.wikipedia.org/wiki/Comparison_of_Indonesian_and_Standard_Malay
  describes differences, but none seem likely to negatively affect the
  effectiveness of a stemming algorithm designed for one language at
  stemming the other.  (Differences in vocabulary and especially in loan
  word origin potentially might, for example if many loan words only
  used in one language are misstemmed by rules designed for the other -
  that's hard for me to judge.)

* https://en.wikipedia.org/wiki/Malay_grammar#Affixes suggests the same
  affixes are seen in both languages.

* Testing the stemmer with Malay text seems to give similarly good results.
  For example, trying the examples given in the old request for a Malay
  stemmer:

$ printf '%s\n' makanan pemakan dimakan pemakanan termakan makan|./stemwords -l indonesian -p2
makanan                       makan
pemakan                       pakan
dimakan                       makan
pemakanan                     pakan
termakan                      makan
makan                         makan

  (The handling of "pemakan" and "pemakanan" is not ideal, but all these
  words seem to all also be Indonesian words with the same meanings as
  in Malay so this is not specific to using the stemmer for Malay.)

Thoughts?  If you want to try the stemmer out, the easiest way is
probably our online Javascript demo:

https://snowballstem.org/demo.html#Indonesian

Cheers,
    Olly



More information about the Snowball-discuss mailing list