[Snowball-discuss] Spanish stemmer with accents stripped before stemming

Sat May 19 06:18:49 BST 2007

Hello!

We're trying to fine-tune an applicaiton we're building that searches
mainly Spanish content.  We've been working with Lucene and the included
Spanish snowball stemmer.

We are removing accents before stemming because misplaced-accent errors
are extremely common--we cannot expect users to enter correctly accented
terms in their queries.

Our approach so far has been to simply replace accented characters in
the Spanish stemmer Java code with unaccented ones. This has produced a
stemmer that works--to a degree--but some simple terms are no longer
stemmed correctly. Specifically, regular plural masculine nouns
("libros" or "datos") are not stemmed at all.

Still, after reading the explanation of the Spanish algorithm [1] I
can't see why this would be the case--or why the removal of accented
characters from both the code and stemmer input would affect the
algorithm's effectivenes at all.

You can see our code here [2] and here [3].

Any suggestions?

Thanks in advance,
Andrew Green

[1] http://snowball.tartarus.org/algorithms/spanish/stemmer.html
[2] http://200.67.231.185/svn/pescador/trunk/java_source/net/sf/snowball/ext/Spanish2Stemmer.java
[3] http://200.67.231.185/svn/pescador/trunk/java_source/org/apache/lucene/analysis/snowball/SnowballAnalyzerWithoutAccents.java