[Snowball-discuss] Chinese and Japanese stemmeing algorithm
Richard Boulton
richard at tartarus.org
Wed Nov 2 11:03:18 GMT 2011
On 2 November 2011 10:51, Martin Porter <martin.f.porter at gmail.com> wrote:
> Does anyone else contributing to snowball
> discuss have more knowledge on what is currently available?
There are a few things available.
One easy approach is to use bi-grams of CJK characters; which doesn't
work wornderfully, but is better than nothing. There are some bits of
code lying around to assist with that; for example,
http://code.google.com/p/cjk-tokenizer/
There are also more sophisticated approaches, generally involving some
use of dictionaries. I don't know of standalone code for doing these,
but we had a Google-Summer-of-Code student with Xapian this year who
implemented quite a lot of stuff for Chinese word segmentation; his
work hasn't been integrated into Xapian core yet, but the trac page
describing it (with links to the code) is
http://trac.xapian.org/wiki/GSoC2011/ChineseSegmentationAnalysis
Olly Betts was his primary mentor, so may be able to give more detail.
--
Richard
More information about the Snowball-discuss
mailing list