[Snowball-discuss] Doubt about the portuguese stem

Grant Ingersoll gsingers at apache.org
Thu Aug 20 14:39:08 BST 2009


I did the following in Lucene, using the Snowball Portuguese stemmer:

PortugueseStemmer stemmer = new PortugueseStemmer();
     stemmer.setCurrent("então");
     stemmer.stem();
     System.out.println("Stem: " + stemmer.getCurrent());

The output is:
Stem: entã

I can't speak for Sphinx, but in Lucene/Solr (http://lucene.apache.org/solr 
) the common step to do (either before or after stemming) is to strip  
accents, such that entã would become enta, which would then allow  
users who both know the correct accents and those who don't to still  
get matches.  In short, the problem isn't in Snowball, it is either in  
your setup of Sphinx or Sphinx itself in that it doesn't allow you to  
strip accents.

Hope this helps,
Grant

On Aug 19, 2009, at 4:31 PM, Leonardo Borges wrote:

> Hello guys,
>
> I am currently evaluating Sphinx as an option for my projects and,  
> since I am brazilian, wanted to give it a try to the Portuguese  
> stemmer you guys provide.
>
> Thus, I compiled sphinx with the libstemmer option and everything  
> went great.
>
> Given the following phrase, in one of my documents: "Então, vamos  
> começar a usar libstemmer"
>
> The following searches return the correct document:
> "Então", "Entã", "então", "entã"
> which is great, but if I search for:
> "Entao"
> It returns nothing.
>
> Since I didn't dig into the algorithm, is this the expected  
> behavior? In that case, the way to accomplish what I'm trying is  
> removing accents myself? Or perhaps you guys have other suggestions?
>
> Thanks a lot,
> Leonardo Borges
> www.leonardoborges.com
> _______________________________________________
> Snowball-discuss mailing list
> Snowball-discuss at lists.tartarus.org
> http://lists.tartarus.org/mailman/listinfo/snowball-discuss

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.tartarus.org/mailman/private/snowball-discuss/attachments/20090820/dcd6f6bb/attachment.htm 


More information about the Snowball-discuss mailing list