[Xapian-discuss] Chinese, Japanese, Korean Tokenizer.

☼ 林永忠 ☼ (Yung-chung Lin, a.k.a. "Kaspar" or "xern") henearkrxern at gmail.com
Wed Jun 6 03:00:27 BST 2007


Accurate Chinese tokenization is difficult. Generally, using bigram
tokenization is better than unigram in Chinese processing and can meet
average needs. There is no need to use tokens longer than trigram,
which would result in worse performace.

Best,
Yung-chung Lin

On 6/6/07, Kevin Duraj <kevin.softdev at gmail.com> wrote:
> The site talks about: CJKTokenizer performs other token methods for
> double-byte Characters: the token will return at each two charactors
> with overlap match.
> Example: "java C1C2C3C4" will be segment to: "java" "C1C2" "C2C3"
> "C3C4" it also need filter filter zero length token ""
>
> Perhaps we could segment double-byte into terms  similarly what Java
> does. It is not perfect but least we could start to index and search
> Asian "characters"  ...
>
> Cheers
>   -Kevin
>
>
> On 6/5/07, Olly Betts <olly at survex.com> wrote:
> > On Tue, Jun 05, 2007 at 02:37:27PM -0700, Kevin Duraj wrote:
> > > I am looking for Chinese Japanese and Korean tokenizer that could can
> > > be use to tokenize terms for CJK languages. I am not very familiar
> > > with these languages however I think that these languages contains one
> > > or more words in one symbol which it make more difficult to tokenize
> > > into searchable terms.
> >
> > I've not investigated Japanese much or Korean at all, but I know a
> > little about Chinese.
> >
> > Chinese "characters" are themselves words, but many words are formed
> > from multiple characters.  For example, the Chinese capital Beijing is
> > formed from two characters (which literally mean something like "North
> > Capital").
> >
> > The difficulty is that Chinese text is usually written without any
> > indication of how the symbols group, so you need an algorithm to
> > identify them if you want to index such groups as terms.  I understand
> > that's quite a hard problem.
> >
> > However, perhaps you don't need to do that.  You could just index each
> > symbol as a word and use phrase searching, or something like it.
> >
> > > Lucene has CJK Tokenizer ... and I am looking around if there is some
> > > open source that we could use with Xapian.
> > >
> > > http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/analysis/cjk/package-summary.html
> >
> > That doesn't provide much information, but if you can find the source
> > code, you could analyse the algorithm used and if it's any good
> > implement it for use with Xapian.
> >
> > Cheers,
> >     Olly
> >
>
>
> --
> Kevin Duraj
> http://myhealthcare.com
>
> _______________________________________________
> Xapian-discuss mailing list
> Xapian-discuss at lists.xapian.org
> http://lists.xapian.org/mailman/listinfo/xapian-discuss
>



More information about the Xapian-discuss mailing list