[Xapian-discuss] Chinese, Japanese, Korean Tokenizer.

Wed Jun 13 04:10:55 BST 2007

Here is a simple and the most easy-to-implement tokenizer with some
intelligence, which uses maximum matching.
It can extract tokens of up to 5 Chinese characters. You may convert
the code to C/C++ to optimize it.
The training data is available at
http://www.sighan.org/bakeoff2003/as_training.zip.
The data is encoded in BIG5.

#!/usr/bin/perl

use strict;
use encoding 'utf8';
use Data::Dumper;
use Encode;
use List::Util qw(min);

my $tok_info;

sub load_token_info {
    open my $fh, '<', 'TrainingSinicaCorpus.txt' or die $!;
    while (my $line = <$fh>) {
        chomp $line;
        $line = decode('big5', $line);
        for my $token (split /\s+/, $line) {
            $tok_info->{$token} = 1;
        }
    }
    close $fh;

    open my $fh, '>', 'tok_info.txt' or die $!;
    $Data::Dumper::Terse = 1;
    $Data::Dumper::Indent = 0;
    print {$fh} Dumper $tok_info;
    close $fh;
}

sub tokenize {
    my $string = shift;
    $string = decode("utf8", $string);
    my $i = 0;
    while ($i < length $string) {
        for my $l (reverse 1..min(length($string), 5)) {
            if ($tok_info->{substr($string, $i, $l)}) {
                print substr($string, $i, $l), $/;
                $i += $l;
                last;
            }
        }
    }
}

sub main {
    my $string = shift @ARGV;
    if (!-e 'tok_info.txt') {
        load_token_info();
    }
    else {
        $tok_info = do 'tok_info.txt';
    }
    tokenize($string);
}

main;

On 6/6/07, Kevin Duraj <kevin.softdev at gmail.com> wrote:
> The site talks about: CJKTokenizer performs other token methods for
> double-byte Characters: the token will return at each two charactors
> with overlap match.
> Example: "java C1C2C3C4" will be segment to: "java" "C1C2" "C2C3"
> "C3C4" it also need filter filter zero length token ""
>
> Perhaps we could segment double-byte into terms  similarly what Java
> does. It is not perfect but least we could start to index and search
> Asian "characters"  ...
>
> Cheers
>   -Kevin
>
>
> On 6/5/07, Olly Betts <olly at survex.com> wrote:
> > On Tue, Jun 05, 2007 at 02:37:27PM -0700, Kevin Duraj wrote:
> > > I am looking for Chinese Japanese and Korean tokenizer that could can
> > > be use to tokenize terms for CJK languages. I am not very familiar
> > > with these languages however I think that these languages contains one
> > > or more words in one symbol which it make more difficult to tokenize
> > > into searchable terms.
> >
> > I've not investigated Japanese much or Korean at all, but I know a
> > little about Chinese.
> >
> > Chinese "characters" are themselves words, but many words are formed
> > from multiple characters.  For example, the Chinese capital Beijing is
> > formed from two characters (which literally mean something like "North
> > Capital").
> >
> > The difficulty is that Chinese text is usually written without any
> > indication of how the symbols group, so you need an algorithm to
> > identify them if you want to index such groups as terms.  I understand
> > that's quite a hard problem.
> >
> > However, perhaps you don't need to do that.  You could just index each
> > symbol as a word and use phrase searching, or something like it.
> >
> > > Lucene has CJK Tokenizer ... and I am looking around if there is some
> > > open source that we could use with Xapian.
> > >
> > > http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/analysis/cjk/package-summary.html
> >
> > That doesn't provide much information, but if you can find the source
> > code, you could analyse the algorithm used and if it's any good
> > implement it for use with Xapian.
> >
> > Cheers,
> >     Olly
> >
>
>
> --
> Kevin Duraj
> http://myhealthcare.com
>
> _______________________________________________
> Xapian-discuss mailing list
> Xapian-discuss at lists.xapian.org
> http://lists.xapian.org/mailman/listinfo/xapian-discuss
>