[Snowball-discuss] Chinese and Japanese stemmeing algorithm

Richard Boulton richard at tartarus.org
Wed Nov 2 11:03:18 GMT 2011


On 2 November 2011 10:51, Martin Porter <martin.f.porter at gmail.com> wrote:
> Does anyone else contributing to snowball
> discuss have more knowledge on what is currently available?

There are a few things available.

One easy approach is to use bi-grams of CJK characters; which doesn't
work wornderfully, but is better than nothing.  There are some bits of
code lying around to assist with that; for example,
http://code.google.com/p/cjk-tokenizer/

There are also more sophisticated approaches, generally involving some
use of dictionaries.  I don't know of standalone code for doing these,
but we had a Google-Summer-of-Code student with Xapian this year who
implemented quite a lot of stuff for Chinese word segmentation; his
work hasn't been integrated into Xapian core yet, but the trac page
describing it (with links to the code) is
http://trac.xapian.org/wiki/GSoC2011/ChineseSegmentationAnalysis

Olly Betts was his primary mentor, so may be able to give more detail.

-- 
Richard



More information about the Snowball-discuss mailing list