[Xapian-discuss] Czech stemming

Wed Jul 14 04:51:42 BST 2010

On Tue, Jul 13, 2010 at 05:47:01PM +0200, Ladislav Durchánek wrote:
> I just find Xapian project when looking for some indexing engine in Ruby and
> was quite impressed. Is there any change for Czech stemming? I found that it
> is already written in Java as part of Lucene here:
> http://svn.apache.org/viewvc/lucene/dev/trunk/modules/analysis/common/src/ja
> va/org/apache/lucene/analysis/cz/CzechStemmer.java?view=markup 

Looking at the referenced paper, there are two algorithms described, with
3-clause BSD licensed Java implementations at:

http://members.unine.ch/jacques.savoy/clef/index.html

The Lucene implementation looks to be the "light" one, but the more aggressive
algorithm seems to do better in the evaluations in the paper.

These are probably a better starting point, as we can translate the Java
implementation without any worries with licence compatibility.

It looks like it wouldn't be hard to add.  There are two approaches:

* Implement it in Snowball (http://snowball.tartarus.org/).

* Implement it in C/C++.

The advantage of putting it in Snowball is that other projects can benefit
more easily, and it could be made to work with Xapian 1.0.x easily.

A C/C++ implementation requires the new "user stemmer" feature added in
Xapian 1.2.1, but is probably less work (since we already have working Java
implementations), and the result may be a little faster.

> Sadly, I have no experience with C++, but I am free to help it with testing
> and/or sponsoring porting it to Xapian.

We really need a suitably-licensed decent sized word list with corresponding
stems - for the current stemmers we have 20-100 thousand words for each
language.  Ideally the corresponding stems should not just be produced by
running the stemmer on the input list (since that will only catch future
incompatible changes to the implementation, not bugs in the initial version),
but it's not practical to derive them all by hand, so checking a subset by hand
and comparing with the other implementations is probably a good approach.

Word lists with a suitable licence can be surprisingly hard to find.  If you
can find an out of copyright Czech text (ideally not so old that the language
has changed significantly since), we could easily generate a word list from
that.

If you're able to sponsor work on this, that would certainly be appreciated
and would help it happen sooner.  Feel free to email me off-list to discuss.

Cheers,
    Olly