[Snowball-discuss] Doubt about the portuguese stem
Grant Ingersoll
gsingers at apache.org
Thu Aug 20 14:39:08 BST 2009
I did the following in Lucene, using the Snowball Portuguese stemmer:
PortugueseStemmer stemmer = new PortugueseStemmer();
stemmer.setCurrent("então");
stemmer.stem();
System.out.println("Stem: " + stemmer.getCurrent());
The output is:
Stem: entã
I can't speak for Sphinx, but in Lucene/Solr (http://lucene.apache.org/solr
) the common step to do (either before or after stemming) is to strip
accents, such that entã would become enta, which would then allow
users who both know the correct accents and those who don't to still
get matches. In short, the problem isn't in Snowball, it is either in
your setup of Sphinx or Sphinx itself in that it doesn't allow you to
strip accents.
Hope this helps,
Grant
On Aug 19, 2009, at 4:31 PM, Leonardo Borges wrote:
> Hello guys,
>
> I am currently evaluating Sphinx as an option for my projects and,
> since I am brazilian, wanted to give it a try to the Portuguese
> stemmer you guys provide.
>
> Thus, I compiled sphinx with the libstemmer option and everything
> went great.
>
> Given the following phrase, in one of my documents: "Então, vamos
> começar a usar libstemmer"
>
> The following searches return the correct document:
> "Então", "Entã", "então", "entã"
> which is great, but if I search for:
> "Entao"
> It returns nothing.
>
> Since I didn't dig into the algorithm, is this the expected
> behavior? In that case, the way to accomplish what I'm trying is
> removing accents myself? Or perhaps you guys have other suggestions?
>
> Thanks a lot,
> Leonardo Borges
> www.leonardoborges.com
> _______________________________________________
> Snowball-discuss mailing list
> Snowball-discuss at lists.tartarus.org
> http://lists.tartarus.org/mailman/listinfo/snowball-discuss
--------------------------
Grant Ingersoll
http://www.lucidimagination.com/
Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
using Solr/Lucene:
http://www.lucidimagination.com/search
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.tartarus.org/mailman/private/snowball-discuss/attachments/20090820/dcd6f6bb/attachment.htm
More information about the Snowball-discuss
mailing list