[Snowball-discuss] Extending the Java compiler
Richard Boulton
richard at lemurconsulting.com
Mon Mar 3 20:12:58 GMT 2008
Sebastiano Vigna wrote:
> We propose you to integrate our changes in your distribution so to both
> distributing a much faster version, and avoiding us the problem of
> releasing a forked compiler for MG4J.
I'm pleased to say I've just committed various fixes supplied by you to
the snowball HEAD. However, the patch I received didn't quite work (it
referenced a non-existent AbstractSnowballTermProcessor class, for a
start), so what I've committed is slightly different to the contents of
the patch. For reference, the changes I committed are at
http://svn.tartarus.org/trunk/snowball/?rev=502&root=Snowball&view=rev
Rather than add a SnowballStemmer class which is a near duplicate of
SnowballProgram (ie, identical except for the added stem() method), I
added a SnowballStemmer class which inherits fom SnowballProgram. I
think this makes sense, but let me know if it is a problem. The
generated stemming algorithms inherit from SnowballStemmer, since they
all have a stem() method.
> MG4J contains a high-performance implementation of strings called
> MutableString. We want to avoid StringBuffer/String whenever possible,
> and use MutableString instead. To avoid dependency on MutableString,
> however, and the inherent slowness of StringBuffer (which is
> synchronised) at the same time, we propose to compile by default for
> Java 1.5 using StringBuilder instead (the Java replacement for
> StringBuffer).
This change is now applied. However, in order to allow either
StringBuffer or StringBuilder to be used, I had to add overloaded
versions of SnowballProgram.slice_to() and assign_to(). (The other
SnowballProgram methods which took a StringBuffer could be modified to
take a CharSequence.) I think this will prevent compilation with java
<1.5, which is a shame (but, I'm not sure anyone will be using snowball
with <1.5 anyway).
> The user will be able, however, to supply its own mutable
> string buffer class, provided it sports the StringBuilder methods used
> by Snowball. Thus, people will be able to supply java.lang.StringBuffer,
> to compile for Java <1.5, or it.unimi.dsi.mg4j.util.MutableString, to
> compile for MG4J. We would then integrate in MG4J a customised
> SnowballProgram, which would interface with the stemmers generated by
> the compiler.
I don't think this will quite work with the patch I just committed,
because the SnowballProgram.slice_to() method will not accept a
it.unimi.dsi.mg4j.util.MutableString (unless that happens to be a
subclass of a StringBuffer or StringBuilder). Currently, the only
solution for this I can think of is to produce a suitable
SnowballProgram subclass as part of the output from snowball - a nicer
solution would be desirable, if you can think of one.
> Another area of improvement, which however we didn't touch, is the
> invocation of special methods used in Among, which uses reflection.
> Unfortunately, reflective method calls are about 20 times slower than
> standard method calls. The right way of passing a method is having a
> strategy object: since the only stemmer using that feature is the
> Finnish stemmer, we are not presently attacking the problem.
Let us know if you do address it - this problem has been reported
frequently, and we'd be happy to apply a patch which fixed it.
Fortunately, even in the finnish case, I believe it is rare for the
method to be called.
--
Richard
More information about the Snowball-discuss
mailing list