[Snowball-discuss] Chinese and Japanese stemmeing algorithm
Olly Betts
olly at survex.com
Thu Nov 3 10:30:41 GMT 2011
On Wed, Nov 02, 2011 at 11:03:18AM +0000, Richard Boulton wrote:
> There are also more sophisticated approaches, generally involving some
> use of dictionaries. I don't know of standalone code for doing these,
> but we had a Google-Summer-of-Code student with Xapian this year who
> implemented quite a lot of stuff for Chinese word segmentation; his
> work hasn't been integrated into Xapian core yet, but the trac page
> describing it (with links to the code) is
> http://trac.xapian.org/wiki/GSoC2011/ChineseSegmentationAnalysis
That work is currently pretty much standalone (I think it uses Xapian's
Utf-8 support, but that wouldn't be hard to replace if you wanted to
use it in another context).
There's also scws which is standalone:
http://www.ftphp.com/scws/
I don't know a whole lot about it - I only know of it because there's a
patch for Xapian integration from its author:
http://article.gmane.org/gmane.comp.search.xapian.general/9052
> Olly Betts was his primary mentor, so may be able to give more detail.
I can certainly try to answer questions.
Cheers,
Olly
More information about the Snowball-discuss
mailing list