[Snowball-discuss] Malay vs Indonesian stemming
Olly Betts
olly at survex.com
Tue Jan 13 20:32:38 GMT 2026
I'm wondering if we should advertise our current Indonesian stemmer
as being suitable for Malay and Indonesian.
Some internet research and a bit of testing makes me suspect we should,
but I'd really like input from someone more familiar with these
languages. (While I implemented the Snowball Indonesian stemmer, it's
almost entirely based on a paper. I didn't develop the algorithm,
though I did have to resolve some ambiguities in the algorithm
description in the paper, and my knowledge of the language mostly comes
from that.)
We've had a least one request for a Malay stemmer, although it was over
20 years ago (the original requester's address seems to be missing, but
probably a student's email address from 2004 would no longer work anyway):
https://lists.tartarus.org/pipermail/snowball-discuss/2004-August/000653.html
I'll briefly lay out the evidence I've found:
* The paper we're implementing the algorithm from ("A Study of Stemming
Effects on Information Retrieval in Bahasa Indonesia" by Fadillah Z
Tala) notes for example "We also looked at the morphological structure
of Malay language words [1], since Bahasa Indonesia is very similar to
Malay language".
* https://en.wikipedia.org/wiki/Comparison_of_Indonesian_and_Standard_Malay
describes differences, but none seem likely to negatively affect the
effectiveness of a stemming algorithm designed for one language at
stemming the other. (Differences in vocabulary and especially in loan
word origin potentially might, for example if many loan words only
used in one language are misstemmed by rules designed for the other -
that's hard for me to judge.)
* https://en.wikipedia.org/wiki/Malay_grammar#Affixes suggests the same
affixes are seen in both languages.
* Testing the stemmer with Malay text seems to give similarly good results.
For example, trying the examples given in the old request for a Malay
stemmer:
$ printf '%s\n' makanan pemakan dimakan pemakanan termakan makan|./stemwords -l indonesian -p2
makanan makan
pemakan pakan
dimakan makan
pemakanan pakan
termakan makan
makan makan
(The handling of "pemakan" and "pemakanan" is not ideal, but all these
words seem to all also be Indonesian words with the same meanings as
in Malay so this is not specific to using the stemmer for Malay.)
Thoughts? If you want to try the stemmer out, the easiest way is
probably our online Javascript demo:
https://snowballstem.org/demo.html#Indonesian
Cheers,
Olly
More information about the Snowball-discuss
mailing list