[Snowball-discuss] Re: Possible memory leak in Snowballs Java stemmer (Richard Boulton)
Thu Jun 3 23:57:02 2004
I missed the original message about Java and memory problems, but I've done a =
fair amount of thinking about how the Java code could be re-architected. Here =
are the difficulties with the current system:
1. Multithreading. In a multithreaded app, like a web app, you have to create a =
new instance of a stemmer for each thread. This generates garbage for each new =
thread. The reason is that there are class variables in SnowballProgram.java, =
and two simultaneous calls to stem() will cause problems. Declaring stem() to =
be synchronized solves the threading problem, but it kills performance.
2. Reflection. Among.java relies upon reflection to select a stemmer. =
Reflection is slow and causes big problems for obfuscators.
3. The relationship between Among, SnowballProgram, and the individual stemmers =
A better approach would be to eliminate Among entirely. Don't use =
class.forName() at all. Just put the code which is common to all stemmers in =
SnowballProgram, and have each stemmer inherit from it.
If you modify the stem() method to refrain from accessing any variables defined =
outside the method itself then the multithreading problem will go away.
Another way to make things *much* more efficient is to eliminate all use of =
Strings and StringBuffers. Strings always generate garbage and StringBuffers =
have a lot of synchronized methods. Instead, pass char arrays to stem() which =
contain contain the input and receive the output.
Here's some sample code:
// EnglishStemmer inherits from SnowballProgram, and can be shared by multiple =
SnowballProgram stemmer =3D new EnglishStemmer();=20
String input =3D "hello";
int inLength =3D input.length();
int inOffset =3D 0;
char  in =3D new char;
input.getChars(0, inLength, in, 0);
char  out =3D new char;
int outOffset =3D 0;
int outLength =3D stemmer.stem(in, inOffset, inLength, out, outOffset);=20
The in and out buffers are reusable, making it possible to stem many words =
without generating any garbage at all. Of course, this scheme is only possible =
if all stems are always shorter than some known value, like 64 chars.=20