[Xapian-discuss] Chinese, Japanese, Korean Tokenizer.

☼ 林永忠 ☼ (Yung-chung Lin, a.k.a. "Kaspar" or "xern") henearkrxern at gmail.com
Fri Jun 29 03:15:53 BST 2007

A ready-to-use bigram CJK tokenizer is attached to this mail. Enjoy it. Thanks.

Yung-chung Lin

On 6/6/07, Kevin Duraj <kevin.softdev at gmail.com> wrote:
> Hi,
> I am looking for Chinese Japanese and Korean tokenizer that could can
> be use to tokenize terms for CJK languages. I am not very familiar
> with these languages however I think that these languages contains one
> or more words in one symbol which it make more difficult to tokenize
> into searchable terms.
> Lucene has CJK Tokenizer ... and I am looking around if there is some
> open source that we could use with Xapian.
> http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/analysis/cjk/package-summary.html
> Cheers
>   -Kevin Duraj
> _______________________________________________
> Xapian-discuss mailing list
> Xapian-discuss at lists.xapian.org
> http://lists.xapian.org/mailman/listinfo/xapian-discuss
-------------- next part --------------
#ifndef __TOKENIZER_H__
#define __TOKENIZER_H__

#include <string>
#include <vector>
#include <unicode.h>

namespace cjk {
    enum tokenizer_type {
    class tokenizer {
        enum tokenizer_type _type;
        inline void _convert_unicode_to_char(unicode_char_t &uchar,
                                             unsigned char *p);
        tokenizer(enum tokenizer_type type);
        void tokenize(std::string &str,
                      std::vector<std::string> &token_list);
        void tokenize(char *buf, size_t buf_len,
                      std::vector<std::string> &token_list);
        void split(std::string &str,
                   std::vector<std::string> &token_list);
        void split(char *buf, size_t buf_len,
                   std::vector<std::string> &token_list);


More information about the Xapian-discuss mailing list