[Snowball-discuss] Extending the Java compiler

Richard Boulton richard at lemurconsulting.com
Mon Mar 3 20:12:58 GMT 2008


Sebastiano Vigna wrote:
> We propose you to integrate our changes in your distribution so to both 
> distributing a much faster version, and avoiding us the problem of 
> releasing a forked compiler for MG4J.

I'm pleased to say I've just committed various fixes supplied by you to 
the snowball HEAD.  However, the patch I received didn't quite work (it 
referenced a non-existent AbstractSnowballTermProcessor class, for a 
start), so what I've committed is slightly different to the contents of 
the patch.  For reference, the changes I committed are at 
http://svn.tartarus.org/trunk/snowball/?rev=502&root=Snowball&view=rev

Rather than add a SnowballStemmer class which is a near duplicate of 
SnowballProgram (ie, identical except for the added stem() method), I 
added a SnowballStemmer class which inherits fom SnowballProgram.  I 
think this makes sense, but let me know if it is a problem.  The 
generated stemming algorithms inherit from SnowballStemmer, since they 
all have a stem() method.

> MG4J contains a high-performance implementation of strings called 
> MutableString. We want to avoid StringBuffer/String whenever possible, 
> and use MutableString instead. To avoid dependency on MutableString, 
> however, and the inherent slowness of StringBuffer (which is 
> synchronised) at the same time, we propose to compile by default for 
> Java 1.5 using StringBuilder instead (the Java replacement for 
> StringBuffer).

This change is now applied.  However, in order to allow either 
StringBuffer or StringBuilder to be used, I had to add overloaded 
versions of SnowballProgram.slice_to() and assign_to().  (The other 
SnowballProgram methods which took a StringBuffer could be modified to 
take a CharSequence.)  I think this will prevent compilation with java 
<1.5, which is a shame (but, I'm not sure anyone will be using snowball 
with <1.5 anyway).

 > The user will be able, however, to supply its own mutable
> string buffer class, provided it sports the StringBuilder methods used 
> by Snowball. Thus, people will be able to supply java.lang.StringBuffer, 
> to compile for Java <1.5, or it.unimi.dsi.mg4j.util.MutableString, to 
> compile for MG4J. We would then integrate in MG4J a customised 
> SnowballProgram, which would interface with the stemmers generated by 
> the compiler.

I don't think this will quite work with the patch I just committed, 
because the SnowballProgram.slice_to() method will not accept a 
it.unimi.dsi.mg4j.util.MutableString  (unless that happens to be a 
subclass of a StringBuffer or StringBuilder).  Currently, the only 
solution for this I can think of is to produce a suitable 
SnowballProgram subclass as part of the output from snowball - a nicer 
solution would be desirable, if you can think of one.

> Another area of improvement, which however we didn't touch, is the 
> invocation of special methods used in Among, which uses reflection. 
> Unfortunately, reflective method calls are about 20 times slower than 
> standard method calls. The right way of passing a method is having a 
> strategy object: since the only stemmer using that feature is the 
> Finnish stemmer, we are not presently attacking the problem.

Let us know if you do address it - this problem has been reported 
frequently, and we'd be happy to apply a patch which fixed it. 
Fortunately, even in the finnish case, I believe it is rare for the 
method to be called.

-- 
Richard



More information about the Snowball-discuss mailing list