[Snowball-discuss] Extending the Java compiler
Sebastiano Vigna
vigna at dsi.unimi.it
Tue Nov 20 10:52:06 GMT 2007
Dear Martin & Richard,
we are writing you after doing some work on the Java version of
libstemmer. We are interested mainly in two areas:
0) efficiency;
1) easy integration with our search engine, MG4J.
For the first issue, we have reworked the Java code fixing a number
of problems (instance instead of class variables, etc.) and presently
we stem about three times faster than before. We have also
substituted characters array instead of strings whenever possible,
and done other routine Java optimisations.
We propose you to integrate our changes in your distribution so to
both distributing a much faster version, and avoiding us the problem
of releasing a forked compiler for MG4J.
MG4J contains a high-performance implementation of strings called
MutableString. We want to avoid StringBuffer/String whenever
possible, and use MutableString instead. To avoid dependency on
MutableString, however, and the inherent slowness of StringBuffer
(which is synchronised) at the same time, we propose to compile by
default for Java 1.5 using StringBuilder instead (the Java
replacement for StringBuffer). The user will be able, however, to
supply its own mutable string buffer class, provided it sports the
StringBuilder methods used by Snowball. Thus, people will be able to
supply java.lang.StringBuffer, to compile for Java <1.5, or
it.unimi.dsi.mg4j.util.MutableString, to compile for MG4J. We would
then integrate in MG4J a customised SnowballProgram, which would
interface with the stemmers generated by the compiler.
Another area of improvement, which however we didn't touch, is the
invocation of special methods used in Among, which uses reflection.
Unfortunately, reflective method calls are about 20 times slower than
standard method calls. The right way of passing a method is having a
strategy object: since the only stemmer using that feature is the
Finnish stemmer, we are not presently attacking the problem.
Please let us now if you're interested in integrating the changes.
Everything will be backward compatible, but also three times faster
and more open to integration.
Ciao,
seba & oerd
More information about the Snowball-discuss
mailing list