[Snowball-discuss] Chinese and Japanese stemmeing algorithm

Dawid Weiss dawid.weiss at cs.put.poznan.pl
Wed Nov 2 11:09:30 GMT 2011


There is a CJK segmentation engine inside the Apache Lucene project
(imported from somewhere else, but with improvements). There is also a
Chinese segmentation tutorial at LingPipe's website here:

http://alias-i.com/lingpipe/demos/tutorial/chineseTokens/read-me.html

Dawid

On Wed, Nov 2, 2011 at 12:03 PM, Richard Boulton <richard at tartarus.org> wrote:
> On 2 November 2011 10:51, Martin Porter <martin.f.porter at gmail.com> wrote:
>> Does anyone else contributing to snowball
>> discuss have more knowledge on what is currently available?
>
> There are a few things available.
>
> One easy approach is to use bi-grams of CJK characters; which doesn't
> work wornderfully, but is better than nothing.  There are some bits of
> code lying around to assist with that; for example,
> http://code.google.com/p/cjk-tokenizer/
>
> There are also more sophisticated approaches, generally involving some
> use of dictionaries.  I don't know of standalone code for doing these,
> but we had a Google-Summer-of-Code student with Xapian this year who
> implemented quite a lot of stuff for Chinese word segmentation; his
> work hasn't been integrated into Xapian core yet, but the trac page
> describing it (with links to the code) is
> http://trac.xapian.org/wiki/GSoC2011/ChineseSegmentationAnalysis
>
> Olly Betts was his primary mentor, so may be able to give more detail.
>
> --
> Richard
>
> _______________________________________________
> Snowball-discuss mailing list
> Snowball-discuss at lists.tartarus.org
> http://lists.tartarus.org/mailman/listinfo/snowball-discuss
>
>



More information about the Snowball-discuss mailing list