[Snowball-discuss] Chinese and Japanese stemmeing algorithm
Charlie Hull
charlie at flax.co.uk
Wed Nov 2 10:56:40 GMT 2011
On 02/11/2011 10:51, Martin Porter wrote:
> Miguel,
>
> Chinese does not have inflectional endings, so the concept of stemming
> is not relevant to that language. The big problem in Chinese is the
> division of text to retrieve, and queries, into "words", or at least
> into units for indexing and retrieval. (Chinese is of course written
> without spaces.) There is a big literature on this in IR, but my own
> knowledge is now seriously out of date (*footnote). There are systems
> for splitting Chinese text into words. These used to be expensive
> proprietary pieces of software, and I always hoped that free versions
> would eventually appear. Does anyone else contributing to snowball
> discuss have more knowledge on what is currently available?
>
> Japanese is similar but the problem is even worse. Since there are no
> spaces word splitting is required: but thereafter something like
> stemming (or lemmatisation) may be applied. Again, there is a big
> literature.
Hi Miguel,
There was a Google Summer of Code project this year on improving Chinese
segmentation for the Xapian engine:
http://trac.xapian.org/wiki/GSoC2011/ChineseSegmentationAnalysis
Maybe this will be of help. There are some CJK segmentation algorithms
for Lucene I believe; otherwise people go to commercial companies such
as Basistech.
Cheers
Charlie
>
> Actually, no futher stemmers are planned at snowball, although we
> might get occasional contributions.
>
> Martin
>
> (*footnote) for example, Information processing and management vol 35
> (1999) has a special section on Asian language IR pp 421->
>
> On Wed, Nov 2, 2011 at 9:56 AM, Miguel Florido
> <miguel.florido at softonic.com> wrote:
>>
>> Dear Martin
>>
>>
>>
>> We are using one of your projects, the stemming algorithm. Due to our international expansion, we need an stemming algorithm for Chinese and Japanese. As we can see, these two algorithms are not implemented yet, and the http://snowball.tartarus.org/ shows that the last update was on Jul 2010, do you know if is planned to be implemented in a middle/short future?
>>
>> If not, could you tell us other source where we can get these algorithm?.
>> Thanks and best regards.
>> Miguel Florido
>> Web Developer Junior
>> miguel.florido at softonic.com
>>
>> http://www.softonic.com
>
> _______________________________________________
> Snowball-discuss mailing list
> Snowball-discuss at lists.tartarus.org
> http://lists.tartarus.org/mailman/listinfo/snowball-discuss
--
Charlie Hull
Flax - Open Source Enterprise Search
tel/fax: +44 (0)8700 118334
mobile: +44 (0)7767 825828
web: www.flax.co.uk
More information about the Snowball-discuss
mailing list