[Snowball-discuss] More patches

Martin Porter martin.porter at grapeshot.co.uk
Fri Feb 16 13:02:25 GMT 2007


> I wonder if the algorithms should perform lowercasing for you.  

No, I think that would be a mistake. The accompanying documentation
(e.g. Introduction section 4) assumes throughout that normalisation into
lower case has taken place. As the stemmers stand, upper case can be
used as a way of bypassing stemming. Lowercasing is only part of the
normalisation that must be applied to extract stemmable items in the
indexing process, which includes (for the English stemmer) mapping
variant forms of apostrophe into the ASCII char for apostrophe. On the
whole the stemmers are defined without relation to any character set,
and the full lowercasing issue would get caught up in Unicode. 




More information about the Snowball-discuss mailing list