[Snowball-discuss] New, and a couple of questions

Sandy Ganz sganz@bizrate.com
Tue Mar 9 19:05:02 2004


Hello, just got the snowball stemmers working and had a couple of questions
related to memory allocation on the librarys as well as a question on what
was done to the stem candidate.

First question is on how the snowball librarys handle the memory allocation
when a word is stemmed. I have followed it a bit, but unsure if it works the
way I think it does. This is what I think (as also would be nice). When a
word is stemmed the internal allocators will allocate a string starting with
the SN_create_env params that say being with a zero sized buffer for 'p',
then expand as necessary. The memory for 'p' allocated will not be
realloc'ed unless it needs to grow. It this the case? The main reason is
that the allocations are extreemely costly with the amount of data that I am
currently running throught the older 'c' state machine poter stemmer which
is very fast. Can I call SN_create_env with a large starting buffer to
eliminate the allocators being called? The memory is only freed when the
SN_close_env() is called.

Second question, after a stem candidate is passed to the stemmer, is their a
fast way to know that the word was modified, i.e., stemmed. The lenght would
would for some cases but now where the stem length is the same. I would hate
to have to strcmp() each word after stemming to see if it was stemmed (which
I need to know).

I am currently running about 20 gig of data and the bulk of it's text data
touches the stemmer, and anything to keep the speed up will be most
appreciated.

Thanks!

Sandy