[Snowball-discuss] Chinese and Japanese stemmeing algorithm

Wed Nov 2 11:16:06 GMT 2011

I think these emails will be really useful for us.

Currently we're using bi-grams, but we are looking to improve the results of our internal searches. I'll take a look to the mentioned project http://code.google.com/p/cjk-tokenizer/, and try to use it with Sphinx.

Thanks for your response and your interest in help us.

Miguel Florido
Web Developer Junior
miguel.florido at softonic.com

http://www.softonic.com
Edificio Meridian C/ Rosselló i Porcel, 21, planta 17 - 08016 Barcelona (SPAIN)
Tel+34 936 012 700     Fax+34 933 969 292

Award winning company Great Place to Work 2011

This e-mail (and any attached files) may contain confidential and/or privileged information. If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and destroy this e-mail. Any unauthorized copying, disclosure or distribution of the material in this e-mail is strictly forbidden.

-----Mensaje original-----
De: Martin Porter [mailto:martin.f.porter at gmail.com] 
Enviado el: miércoles, 02 de noviembre de 2011 11:52
Para: Miguel Florido
CC: Snowball-discuss at lists.tartarus.org
Asunto: Re: Chinese and Japanese stemmeing algorithm

Miguel,

Chinese does not have inflectional endings, so the concept of stemming
is not relevant to that language. The big problem in Chinese is the
division of text to retrieve, and queries, into "words", or at least
into units for indexing and retrieval. (Chinese is of course written
without spaces.) There is a big literature on this in IR, but my own
knowledge is now seriously out of date (*footnote). There are systems
for splitting Chinese text into words. These used to be expensive
proprietary pieces of software, and I always hoped that free versions
would eventually appear. Does anyone else contributing to snowball
discuss have more knowledge on what is currently available?

Japanese is similar but the problem is even worse. Since there are no
spaces word splitting is required: but thereafter something like
stemming (or lemmatisation) may be applied. Again, there is a big
literature.

Actually, no futher stemmers are planned at snowball, although we
might get occasional contributions.

Martin

(*footnote) for example, Information processing and management vol 35
(1999) has a special section on Asian language IR pp 421->

On Wed, Nov 2, 2011 at 9:56 AM, Miguel Florido
<miguel.florido at softonic.com> wrote:
>
> Dear Martin
>
>
>
> We are using one of your projects, the stemming algorithm. Due to our international expansion, we need an stemming algorithm for Chinese and Japanese. As we can see, these two algorithms are not implemented yet, and the http://snowball.tartarus.org/ shows that the last update was on Jul 2010, do you know if is planned to be implemented in a middle/short future?
>
> If not, could you tell us other source where we can get these algorithm?.
> Thanks and best regards.
> Miguel Florido
> Web Developer Junior
> miguel.florido at softonic.com
>
> http://www.softonic.com