[Snowball-discuss] A stemmer request

Ivan Voras ivoras at gmail.com
Thu May 15 00:03:24 BST 2014


Hello,

I'm not a linguist, just a programmer who wishes to have a stemming
library for the Croatian language available for my needs. The Snowball
library is used in several (open source) projects I'm working with, so
I'm looking for a way to extend it to support my language - eventually
those projects should pick it up.

Sometime in the past I've encountered a table of replacements for word
suffixes which does a pretty good job on the language - certainly not
perfect but "good enough" and infinitely better than nothing. This
table can be seen at:

http://goo.gl/XoZfD2

This particular file contains regex substitutions - if the word ends
with the characters on the left side, the whole sequence is replaced
by the expression on the right side which starts with a regex
backreference. Please ask if this is unclear. This is a single-pass
operation. Note that Unicode is needed.

I don't claim this is the most efficient or the right way to do it,
only that I've tried this table in practice and it works well enough.

I've looked at the Snowball stemmers available for download at
snowball.tartarus.org, but the sbl language is... pretty idiomatic :)

Can anyone help me by providing skeleton code, or documentation, which
I could use to integrate this table into (...or of course if anyone's
willing just take the table and convert it into a sbl yourself without
me :) ) ?.



More information about the Snowball-discuss mailing list