[Snowball-discuss] Chinese and Japanese stemmeing algorithm

Wed Nov 2 11:50:28 GMT 2011

Just for completeness, here is another one Ivan Provalov mentioned at
Apache Lucene EuroCon -- it's said to be really, really good:

http://code.google.com/p/paoding/

I also know that Google Chrome (Chromium project) has a segmentation
and identification engine for multiple languages embedded (for word
highlighting etc.). You may want to check that out.

Dawid

On Wed, Nov 2, 2011 at 12:16 PM, Miguel Florido
<miguel.florido at softonic.com> wrote:
> I think these emails will be really useful for us.
>
> Currently we're using bi-grams, but we are looking to improve the results of our internal searches. I'll take a look to the mentioned project http://code.google.com/p/cjk-tokenizer/, and try to use it with Sphinx.
>
> Thanks for your response and your interest in help us.
>
>
> Miguel Florido
> Web Developer Junior
> miguel.florido at softonic.com
>
>
> http://www.softonic.com
> Edificio Meridian C/ Rosselló i Porcel, 21, planta 17 - 08016 Barcelona (SPAIN)
> Tel+34 936 012 700     Fax+34 933 969 292
>
> Award winning company Great Place to Work 2011
>
>
> This e-mail (and any attached files) may contain confidential and/or privileged information. If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and destroy this e-mail. Any unauthorized copying, disclosure or distribution of the material in this e-mail is strictly forbidden.
>
> -----Mensaje original-----
> De: Martin Porter [mailto:martin.f.porter at gmail.com]
> Enviado el: miércoles, 02 de noviembre de 2011 11:52
> Para: Miguel Florido
> CC: Snowball-discuss at lists.tartarus.org
> Asunto: Re: Chinese and Japanese stemmeing algorithm
>
> Miguel,
>
> Chinese does not have inflectional endings, so the concept of stemming
> is not relevant to that language. The big problem in Chinese is the
> division of text to retrieve, and queries, into "words", or at least
> into units for indexing and retrieval. (Chinese is of course written
> without spaces.) There is a big literature on this in IR, but my own
> knowledge is now seriously out of date (*footnote). There are systems
> for splitting Chinese text into words. These used to be expensive
> proprietary pieces of software, and I always hoped that free versions
> would eventually appear. Does anyone else contributing to snowball
> discuss have more knowledge on what is currently available?
>
> Japanese is similar but the problem is even worse. Since there are no
> spaces word splitting is required: but thereafter something like
> stemming (or lemmatisation) may be applied. Again, there is a big
> literature.
>
> Actually, no futher stemmers are planned at snowball, although we
> might get occasional contributions.
>
> Martin
>
> (*footnote) for example, Information processing and management vol 35
> (1999) has a special section on Asian language IR pp 421->
>
> On Wed, Nov 2, 2011 at 9:56 AM, Miguel Florido
> <miguel.florido at softonic.com> wrote:
>>
>> Dear Martin
>>
>>
>>
>> We are using one of your projects, the stemming algorithm. Due to our international expansion, we need an stemming algorithm for Chinese and Japanese. As we can see, these two algorithms are not implemented yet, and the http://snowball.tartarus.org/ shows that the last update was on Jul 2010, do you know if is planned to be implemented in a middle/short future?
>>
>> If not, could you tell us other source where we can get these algorithm?.
>> Thanks and best regards.
>> Miguel Florido
>> Web Developer Junior
>> miguel.florido at softonic.com
>>
>> http://www.softonic.com
>
> _______________________________________________
> Snowball-discuss mailing list
> Snowball-discuss at lists.tartarus.org
> http://lists.tartarus.org/mailman/listinfo/snowball-discuss
>
>