[Snowball-discuss] Extending the Java compiler

Sebastiano Vigna vigna at dsi.unimi.it
Tue Nov 20 10:52:06 GMT 2007


Dear Martin & Richard,
we are writing you after doing some work on the Java version of  
libstemmer. We are interested mainly in two areas:

0) efficiency;
1) easy integration with our search engine, MG4J.

For the first issue, we have reworked the Java code fixing a number  
of problems (instance instead of class variables, etc.) and presently  
we stem about three times faster than before. We have also  
substituted characters array instead of strings whenever possible,  
and done other routine Java optimisations.

We propose you to integrate our changes in your distribution so to  
both distributing a much faster version, and avoiding us the problem  
of releasing a forked compiler for MG4J.

MG4J contains a high-performance implementation of strings called  
MutableString. We want to avoid StringBuffer/String whenever  
possible, and use MutableString instead. To avoid dependency on  
MutableString, however, and the inherent slowness of StringBuffer  
(which is synchronised) at the same time, we propose to compile by  
default for Java 1.5 using StringBuilder instead (the Java  
replacement for StringBuffer). The user will be able, however, to  
supply its own mutable string buffer class, provided it sports the  
StringBuilder methods used by Snowball. Thus, people will be able to  
supply java.lang.StringBuffer, to compile for Java <1.5, or  
it.unimi.dsi.mg4j.util.MutableString, to compile for MG4J. We would  
then integrate in MG4J a customised SnowballProgram, which would  
interface with the stemmers generated by the compiler.

Another area of improvement, which however we didn't touch, is the  
invocation of special methods used in Among, which uses reflection.  
Unfortunately, reflective method calls are about 20 times slower than  
standard method calls. The right way of passing a method is having a  
strategy object: since the only stemmer using that feature is the  
Finnish stemmer, we are not presently attacking the problem.

Please let us now if you're interested in integrating the changes.  
Everything will be backward compatible, but also three times faster  
and more open to integration.

Ciao,

					seba & oerd




More information about the Snowball-discuss mailing list