[Snowball-discuss] Chinese and Japanese stemmeing algorithm

Martin Porter martin.f.porter at gmail.com
Wed Nov 2 10:51:37 GMT 2011


Miguel,

Chinese does not have inflectional endings, so the concept of stemming
is not relevant to that language. The big problem in Chinese is the
division of text to retrieve, and queries, into "words", or at least
into units for indexing and retrieval. (Chinese is of course written
without spaces.) There is a big literature on this in IR, but my own
knowledge is now seriously out of date (*footnote). There are systems
for splitting Chinese text into words. These used to be expensive
proprietary pieces of software, and I always hoped that free versions
would eventually appear. Does anyone else contributing to snowball
discuss have more knowledge on what is currently available?

Japanese is similar but the problem is even worse. Since there are no
spaces word splitting is required: but thereafter something like
stemming (or lemmatisation) may be applied. Again, there is a big
literature.

Actually, no futher stemmers are planned at snowball, although we
might get occasional contributions.

Martin

(*footnote) for example, Information processing and management vol 35
(1999) has a special section on Asian language IR pp 421->

On Wed, Nov 2, 2011 at 9:56 AM, Miguel Florido
<miguel.florido at softonic.com> wrote:
>
> Dear Martin
>
>
>
> We are using one of your projects, the stemming algorithm. Due to our international expansion, we need an stemming algorithm for Chinese and Japanese. As we can see, these two algorithms are not implemented yet, and the http://snowball.tartarus.org/ shows that the last update was on Jul 2010, do you know if is planned to be implemented in a middle/short future?
>
> If not, could you tell us other source where we can get these algorithm?.
> Thanks and best regards.
> Miguel Florido
> Web Developer Junior
> miguel.florido at softonic.com
>
> http://www.softonic.com



More information about the Snowball-discuss mailing list