[Snowball-discuss] Extending the Java compiler

Martin Porter martin.porter at grapeshot.co.uk
Mon Dec 3 17:31:11 GMT 2007


Sebastiano, 

Thanks for the email, and your notes on the Java improvements. I have
talked over the phone to Richard Boulton about the best way to proceed.
There a couple of problems for us: One is that neither of us knows all
that much about Java. (Richard wrote the codegenerator for java, but has
little direct Java experience. The Porter stemmer in Java is the only
Java program of significance I've ever written.) The other is that we're
both rather busy at the moment with other work.  

I suggest you put your changes into a tar file and send it to us. We can
then offer it for distribution from the site until we get round to
incorporating your work.

On the use of "reflection" in the Among implementation, I don't think
there is anything much to worry about since it is so rarely used.

A couple of things you might advise us on: are the speed improvements
you've made equally effective across different styles of Java (compiled
vs interprested, with or without native threads, recent and early
releases etc)? And are the changes you've made tied to later releases in
any way?

Martin 


On Tue, 2007-11-20 at 11:52 +0100, Sebastiano Vigna wrote:
> Dear Martin & Richard,
> we are writing you after doing some work on the Java version of  
> libstemmer. We are interested mainly in two areas:
> 
> 0) efficiency;
> 1) easy integration with our search engine, MG4J.
> 
> For the first issue, we have reworked the Java code fixing a number  
> of problems (instance instead of class variables, etc.) and presently  
> we stem about three times faster than before. We have also  
> substituted characters array instead of strings whenever possible,  
> and done other routine Java optimisations.
> 
> We propose you to integrate our changes in your distribution so to  
> both distributing a much faster version, and avoiding us the problem  
> of releasing a forked compiler for MG4J.
> 
> MG4J contains a high-performance implementation of strings called  
> MutableString. We want to avoid StringBuffer/String whenever  
> possible, and use MutableString instead. To avoid dependency on  
> MutableString, however, and the inherent slowness of StringBuffer  
> (which is synchronised) at the same time, we propose to compile by  
> default for Java 1.5 using StringBuilder instead (the Java  
> replacement for StringBuffer). The user will be able, however, to  
> supply its own mutable string buffer class, provided it sports the  
> StringBuilder methods used by Snowball. Thus, people will be able to  
> supply java.lang.StringBuffer, to compile for Java <1.5, or  
> it.unimi.dsi.mg4j.util.MutableString, to compile for MG4J. We would  
> then integrate in MG4J a customised SnowballProgram, which would  
> interface with the stemmers generated by the compiler.
> 
> Another area of improvement, which however we didn't touch, is the  
> invocation of special methods used in Among, which uses reflection.  
> Unfortunately, reflective method calls are about 20 times slower than  
> standard method calls. The right way of passing a method is having a  
> strategy object: since the only stemmer using that feature is the  
> Finnish stemmer, we are not presently attacking the problem.
> 
> Please let us now if you're interested in integrating the changes.  
> Everything will be backward compatible, but also three times faster  
> and more open to integration.
> 
> Ciao,
> 
> 					seba & oerd
> 
> 
> _______________________________________________
> Snowball-discuss mailing list
> Snowball-discuss at lists.tartarus.org
> http://lists.tartarus.org/mailman/listinfo/snowball-discuss
> 




More information about the Snowball-discuss mailing list