[Snowball-discuss] Spanish stemmer with accents stripped
	before stemming
    Andrew Green 
    ndrw_grn at yahoo.com.mx
       
    Wed May 23 00:09:24 BST 2007
    
    
  
Hi, Martin,
Thank you very much for your replies...
>Obviously to us, it a bit easier to look at the problem
> from the snowball angle, rather than think about the generated java
> after it's been put inside lucene! As far as the snowball script is
> concerned, I believe you could strip out accents from the source,
> eliminate the duplicate strings in the amongs(..) that would result, and
> recompile, getting the effect you want.
OK... We just tried that, and it works very well so far! Thanks a bunch.
> (Incidentally, I have hit this problem with Spanish stemming before, but
> it was a long while ago -- before the development of snowball.)
Accents are the boogeyman of day-to-day written Spanish usage and it's
hard to imagine an effective search engine that obliges users to type
them correctly.
> Thinking about it further, this will not work, since the strings are
> placed in tables which would need to be fully reorganised if any of
> the
> characters in the strings were readjusted. (it is the way a snowball
> 'among' is implemented).
> 
> The only way to do this is to modify the stem.sbl file for Spanish,
> regenerate the java code with the snowball compiler (which you can
> download) and replace the old java with the new in your application.
Ah... So now we know why our previous attempt failed.
It occurs to me that perhaps it would be a good idea to modify
Snowball's Spanish stemmer to accept both accented and accent-stripped
input.
Greetings,
Andrew Green
P.S. Our little server was down during the weekend--sorry--it's back
online again--though now a commit error to the repository has made the
relevant files difficult to access--though now they seem less relevant,
I suppose.
    
    
More information about the Snowball-discuss
mailing list