[Snowball-discuss] -ize and -ise, -ization and -isation

Milan Bouchet-Valat nalimilan at club-internet.fr
Wed Jul 10 10:04:45 BST 2013


Hi!

I am the author of an R package providing a GUI to perform text mining operations [1], and as part of this project I also created a package to allow using libstemmer from R [2].

I'm wondering whether it is intended that both the original Porter and the newer English stemmers consider US forms ending with -ize or -ization different from GB forms ending with  -ise or -ization. Indeed, the algorithms include a rule to replace -izer and -ization with -ize and -alize with -al, and to eventually delete -ize. No such rule exists for GB forms. This problem arised when analysing (ah, GB form!) newswires from different agencies, where different forms where used. It would appear logical to reduce e.g. organization and organisation to the same stem, organ (while currently the latter gives organis).


Thanks for your work!

1: http://cran.r-project.org/web/packages/RcmdrPlugin.temis/
2: http://cran.r-project.org/web/packages/SnowballC/



More information about the Snowball-discuss mailing list