From olly at survex.com Tue Jan 13 20:32:38 2026 From: olly at survex.com (Olly Betts) Date: Wed, 14 Jan 2026 09:32:38 +1300 Subject: [Snowball-discuss] Malay vs Indonesian stemming Message-ID: I'm wondering if we should advertise our current Indonesian stemmer as being suitable for Malay and Indonesian. Some internet research and a bit of testing makes me suspect we should, but I'd really like input from someone more familiar with these languages. (While I implemented the Snowball Indonesian stemmer, it's almost entirely based on a paper. I didn't develop the algorithm, though I did have to resolve some ambiguities in the algorithm description in the paper, and my knowledge of the language mostly comes from that.) We've had a least one request for a Malay stemmer, although it was over 20 years ago (the original requester's address seems to be missing, but probably a student's email address from 2004 would no longer work anyway): https://lists.tartarus.org/pipermail/snowball-discuss/2004-August/000653.html I'll briefly lay out the evidence I've found: * The paper we're implementing the algorithm from ("A Study of Stemming Effects on Information Retrieval in Bahasa Indonesia" by Fadillah Z Tala) notes for example "We also looked at the morphological structure of Malay language words [1], since Bahasa Indonesia is very similar to Malay language". * https://en.wikipedia.org/wiki/Comparison_of_Indonesian_and_Standard_Malay describes differences, but none seem likely to negatively affect the effectiveness of a stemming algorithm designed for one language at stemming the other. (Differences in vocabulary and especially in loan word origin potentially might, for example if many loan words only used in one language are misstemmed by rules designed for the other - that's hard for me to judge.) * https://en.wikipedia.org/wiki/Malay_grammar#Affixes suggests the same affixes are seen in both languages. * Testing the stemmer with Malay text seems to give similarly good results. For example, trying the examples given in the old request for a Malay stemmer: $ printf '%s\n' makanan pemakan dimakan pemakanan termakan makan|./stemwords -l indonesian -p2 makanan makan pemakan pakan dimakan makan pemakanan pakan termakan makan makan makan (The handling of "pemakan" and "pemakanan" is not ideal, but all these words seem to all also be Indonesian words with the same meanings as in Malay so this is not specific to using the stemmer for Malay.) Thoughts? If you want to try the stemmer out, the easiest way is probably our online Javascript demo: https://snowballstem.org/demo.html#Indonesian Cheers, Olly