[Snowball-discuss] More patches

Tolkin, Steve Steve.Tolkin at FMR.COM
Fri Feb 16 13:05:18 GMT 2007


As you point out, some code will already change to lower case.  Therefore it should not be a standard part of the stemmer.  This is primarily for performance reasons.  It would be "nice to have" if Snowball provided a change to lower case feature that could be optionally invoked.  

Hopefully helpfully yours, 
Steve 
-- 
Steve Tolkin    Steve . Tolkin at FMR dot COM   508-787-9006
Fidelity Investments   82 Devonshire St. M3L     Boston MA 02109 
There is nothing so practical as a good theory.  Comments are by me, 
not Fidelity Investments, its subsidiaries or affiliates. 

-----Original Message-----
From: snowball-discuss-bounces at lists.tartarus.org [mailto:snowball-discuss-bounces at lists.tartarus.org] On Behalf Of Olly Betts
Sent: Friday, February 16, 2007 7:06 AM
To: Richard Boulton
Cc: snowball-discuss at lists.tartarus.org
Subject: Re: [Snowball-discuss] More patches

[some snipped]

I wonder if the algorithms should perform lowercasing for you.  In
general it's a required preprocessing step for the stemmers to work
correctly, so most users will need to implement the lower casing for
themselves (except perhaps for applications where the input is always
lowercase already).

The problem I can see is that to do it correctly for all non-ASCII
characters requires fairly large tables, and doing it just for ASCII
letters probably isn't really sufficient.  Perhaps it's only necessary
for characters the stemmers check for though.  Thoughts?

Cheers,
    Olly

_______________________________________________
Snowball-discuss mailing list
Snowball-discuss at lists.tartarus.org
http://lists.tartarus.org/mailman/listinfo/snowball-discuss




More information about the Snowball-discuss mailing list