[Snowball-discuss] Re: Possible memory leak in Snowballs Java stemmer (Richard Boulton)

Chris Cleveland ccleveland@dieselpoint.com
Thu Jun 3 23:57:02 2004


I missed the original message about Java and memory problems, but I've done a =
fair amount of thinking about how the Java code could be re-architected. Here =
are the difficulties with the current system:

1. Multithreading. In a multithreaded app, like a web app, you have to create a =
new instance of a stemmer for each thread. This generates garbage for each new =
thread. The reason is that there are class variables in SnowballProgram.java, =
and two simultaneous calls to stem() will cause problems. Declaring stem() to =
be synchronized solves the threading problem, but it kills performance.

2. Reflection. Among.java relies upon reflection to select a stemmer. =
Reflection is slow and causes big problems for obfuscators.

3. The relationship between Among, SnowballProgram, and the individual stemmers =
is complicated.

A better approach would be to eliminate Among entirely. Don't use =
class.forName() at all. Just put the code which is common to all stemmers in =
SnowballProgram, and have each stemmer inherit from it.

If you modify the stem() method to refrain from accessing any variables defined =
outside the method itself then the multithreading problem will go away.

Another way to make things *much* more efficient is to eliminate all use of =
Strings and StringBuffers. Strings always generate garbage and StringBuffers =
have a lot of synchronized methods. Instead, pass char[] arrays to stem() which =
contain contain the input and receive the output.

Here's some sample code:

// EnglishStemmer inherits from SnowballProgram, and can be shared by multiple =
SnowballProgram stemmer =3D new EnglishStemmer();=20

String input =3D "hello";
int inLength =3D input.length();
int inOffset =3D 0;
char [] in =3D new char[64];
input.getChars(0, inLength, in, 0);

char [] out =3D new char[64];
int outOffset =3D 0;
int outLength =3D stemmer.stem(in, inOffset, inLength, out, outOffset);=20

The in and out buffers are reusable, making it possible to stem many words =
without generating any garbage at all. Of course, this scheme is only possible =
if all stems are always shorter than some known value, like 64 chars.=20