[Xapian-discuss] Chinese, Japanese, Korean Tokenizer.

Kevin Duraj kevin.softdev at gmail.com
Wed Jun 6 02:46:29 BST 2007


The site talks about: CJKTokenizer performs other token methods for
double-byte Characters: the token will return at each two charactors
with overlap match.
Example: "java C1C2C3C4" will be segment to: "java" "C1C2" "C2C3"
"C3C4" it also need filter filter zero length token ""

Perhaps we could segment double-byte into terms  similarly what Java
does. It is not perfect but least we could start to index and search
Asian "characters"  ...

Cheers
  -Kevin


On 6/5/07, Olly Betts <olly at survex.com> wrote:
> On Tue, Jun 05, 2007 at 02:37:27PM -0700, Kevin Duraj wrote:
> > I am looking for Chinese Japanese and Korean tokenizer that could can
> > be use to tokenize terms for CJK languages. I am not very familiar
> > with these languages however I think that these languages contains one
> > or more words in one symbol which it make more difficult to tokenize
> > into searchable terms.
>
> I've not investigated Japanese much or Korean at all, but I know a
> little about Chinese.
>
> Chinese "characters" are themselves words, but many words are formed
> from multiple characters.  For example, the Chinese capital Beijing is
> formed from two characters (which literally mean something like "North
> Capital").
>
> The difficulty is that Chinese text is usually written without any
> indication of how the symbols group, so you need an algorithm to
> identify them if you want to index such groups as terms.  I understand
> that's quite a hard problem.
>
> However, perhaps you don't need to do that.  You could just index each
> symbol as a word and use phrase searching, or something like it.
>
> > Lucene has CJK Tokenizer ... and I am looking around if there is some
> > open source that we could use with Xapian.
> >
> > http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/analysis/cjk/package-summary.html
>
> That doesn't provide much information, but if you can find the source
> code, you could analyse the algorithm used and if it's any good
> implement it for use with Xapian.
>
> Cheers,
>     Olly
>


-- 
Kevin Duraj
http://myhealthcare.com



More information about the Xapian-discuss mailing list