[Snowball-discuss] Chinese and Japanese stemmeing algorithm

Olly Betts olly at survex.com
Thu Nov 3 10:30:41 GMT 2011


On Wed, Nov 02, 2011 at 11:03:18AM +0000, Richard Boulton wrote:
> There are also more sophisticated approaches, generally involving some
> use of dictionaries.  I don't know of standalone code for doing these,
> but we had a Google-Summer-of-Code student with Xapian this year who
> implemented quite a lot of stuff for Chinese word segmentation; his
> work hasn't been integrated into Xapian core yet, but the trac page
> describing it (with links to the code) is
> http://trac.xapian.org/wiki/GSoC2011/ChineseSegmentationAnalysis

That work is currently pretty much standalone (I think it uses Xapian's
Utf-8 support, but that wouldn't be hard to replace if you wanted to
use it in another context).

There's also scws which is standalone:

http://www.ftphp.com/scws/

I don't know a whole lot about it - I only know of it because there's a
patch for Xapian integration from its author:

http://article.gmane.org/gmane.comp.search.xapian.general/9052

> Olly Betts was his primary mentor, so may be able to give more detail.

I can certainly try to answer questions.

Cheers,
    Olly



More information about the Snowball-discuss mailing list