[Snowball-discuss] Re: Possible memory leak in Snowballs Java stemmer

Martin Porter martin.porter@grapeshot.co.uk
Mon May 17 09:51:19 2004


Wolfram,

Thank you for your carefully researched email. The Java generator is the
work of Richard Boulton and we must refer it to him. I hope we'll get an
answer back to you shortly. (I know he is rather busy at the moment)

I am posting this to snowball-discuss@lists.tartarus.org. PLease post
subsequent replies to that address,

Martin


At 16:43 14/05/2004 +0200, Wolfram Esser wrote:
>Hello Mr. Porter!
>
>I am really appreciating you work on the stemming field.... I'm using 
>the German stemmer extensively to find typing mistakes in large 
>electronic encyclopedias.
>
>I am using the Java stemming engine which is provided by this link:
>    http://snowball.tartarus.org/snowball_java.tgz
>Which is  - to my knowledge - the current version of the Java stemmer.
>
>_*Problem:*_
>When stemming about 500,000 words and generating a Java hashmap which 
>maps all the stemms to their corresponding words, I get OutOfMemory 
>exceptions - even with about 700MB of java heapspace and with about 1GB 
>of machine RAM. This is strange, because the raw data needed must be 
>something like 6MB+6MB+small(X) about 20-30 MB of RAM.
>
>_*Analysis:*_
>According to your Java TestApp (delevered with the above archive), after 
>calling stem() one has to use SnowBallProgram.getCurrent() to get the 
>stem of the stemmed word. This method does the following:
>   public String getCurrent()
>    {
>        return current.toString();
>    }
>
>So it converts the StringBuffer current to a String - but:  
>StringBuffer.toString() does a so called "lazy copy" - it does NOT 
>create a fresh new String wich is returned, but instead it creates a 
>"hollow" String object, where it points the data-buffer to the existing 
>StringBuffer. So the StringBuffer and the new String point to the exact 
>same memory block.
>
>So when StringBuffer has allocated 2MB (and only 10 bytes used, which is 
>OK for a StringBuffer!), then the new String points also to a 2MB memory 
>block whith only 10 bytes used.
>Java memorizes this fact and when the StringBuffer changes its value, 
>then the actual copy if the memory is done - but to late! The String 
>object occupies 2MB - and always will - even if only 10 bytes contain 
>useful characters!
>
>As people are calling SnowBall's getCurrent() method often - they almost 
>always get String objects that occupy a lot of useless memory. This is 
>O.K., if they do only use these String for example (like in your 
>TestApp), to do a System.out.println() and discard them afterwards. Then 
>moemy will be freed by Java's garbage collector. But when you keep 
>references to those Strings (e.g. as keys in a HashMap, like in my 
>case!), machines memory runs out lightning fast! Actually I could only 
>store about 300,000 stems in my Hashmap which occupied 600MB of RAM at 
>that time!
>
>So, reusing StringBuffers is actually a usage case which maybe was not 
>intented by the developers of the StringBuffer class.
>
>
>_*Solution:*_
>
>Either the user or you library can do something like this
>    String myStem = new String( germanStemmer.getCurrent());
>
>
>or (which I woul prefer): rewrite the getCurrent method like one of the 
>following (this prevents lib users of using the library in a maybe 
>dangerous was):
>
>   public String getCurrent()
>    {
>        return new String(current);
>    }
>
>or
>   public String getCurrent()
>    {
>        return current.substring(0);
>    }
>
>in both cases only the actual amount of occupied characters is stored in 
>the new String object.
>
>
>
>I dont know who is actually caring for the Java part of Snowball. But 
>I'm sure you can forward this eMail to him/her.
>I really would like to hear from your team, if you could reproduce my 
>problem and find the solution helpful.
>Or did I overlook some other (memory saving) means of getting the 
>desired stem?
>
>Anyway: Thank you for your great work
>and greetings from Germany:
>            Wolfram
>
>
>
>-- 
>
>---
>
>     o    Wolfram Esser (Dipl.-Inform.), Lehrstuhl fuer Informatik II
>    / \       Universitaet Wuerzburg, Am Hubland, D-97074 Wuerzburg
>infoII o      Phone: +49 (0)931-888-6614   Fax: +49 (0)931-888-6603
>  / \         mailto:esser@informatik.uni-wuerzburg.de
> o   o        http://www2.informatik.uni-wuerzburg.de/staff/wolfram/
>
>