[Snowball-discuss] Chinese and Japanese stemmeing algorithm

Richard Boulton richard at tartarus.org
Wed Nov 2 11:03:18 GMT 2011

On 2 November 2011 10:51, Martin Porter <martin.f.porter at gmail.com> wrote:
> Does anyone else contributing to snowball
> discuss have more knowledge on what is currently available?

There are a few things available.

One easy approach is to use bi-grams of CJK characters; which doesn't
work wornderfully, but is better than nothing.  There are some bits of
code lying around to assist with that; for example,

There are also more sophisticated approaches, generally involving some
use of dictionaries.  I don't know of standalone code for doing these,
but we had a Google-Summer-of-Code student with Xapian this year who
implemented quite a lot of stuff for Chinese word segmentation; his
work hasn't been integrated into Xapian core yet, but the trac page
describing it (with links to the code) is

Olly Betts was his primary mentor, so may be able to give more detail.


More information about the Snowball-discuss mailing list