[Snowball-discuss] sb_symbol

Richard Boulton richard at lemurconsulting.com
Wed Jul 8 15:09:45 BST 2009


2009/7/8 Nemeskey Dávid <nemeskey.david at sztaki.hu>:
> I am David Nemeskey, and I've just subscribed to the list. Our company
> uses the snowball stemmer in a search engine. I have two questions about
> the library; I hope it's the right list.
>
> Firstly, we use the Porter stemmer for English. However, sometimes it
> does not give a good result, such as "stemming" bus to bu + PL. My
> question would be if it is possible to use a dictionary with any of the
> Snowball stemmers to avoid this problem.

Use the "english" stemmer (also called "porter2") to avoid this
problem.  It's an improved version of "porter", and in particular, it
stems this example better (bus -> bus, instead of bus -> bu)

See http://snowball.tartarus.org/algorithms/english/stemmer.html for
more information.

> Secondly, I have recently compiled the newest version to include the
> "other" stemmers. When integrating it with our codebase, I have realized
> that sb_symbol had changed from char to unsigned char (I know, it's an
> old one, but up till now we used an older version).

The change was made in 2005 (r348 of the SVN repo):

r348 | richard | 2005-08-24 10:40:38 +0100 (Wed, 24 Aug 2005) | 4 lines

Change sb_symbol to "unsigned char" to match the internal "symbol" type.
This fixes a warning reported by GCC 4.0, so compilation completes with
no warnings.


I can't remember any details other than that, I'm afraid.  The
internal "symbol" type referred to here is defined in runtime/api.h -
I'm not sure if changing it to "char" instead of "unsigned char" would
break anything (but I wouldn't be surprised if it did).

-- 
Richard



More information about the Snowball-discuss mailing list