[Snowball-discuss] Doubt about the portuguese stem

Leonardo Borges leonardoborges.rj at gmail.com
Thu Aug 20 14:43:22 BST 2009


Grant,
Tks for the feedback.

Yeah, I just wanted to confirm that this is how the algorithm is supposed to
work. I'll have to find a way to strip accents using sphinx.

Regards,
Leonardo Borges
www.leonardoborges.com

On Thu, Aug 20, 2009 at 3:39 PM, Grant Ingersoll <gsingers at apache.org>wrote:

> I did the following in Lucene, using the Snowball Portuguese stemmer:
> PortugueseStemmer stemmer = new PortugueseStemmer();
>     stemmer.setCurrent("então");
>     stemmer.stem();
>     System.out.println("Stem: " + stemmer.getCurrent());
>
> The output is:
> Stem: entã
>
> I can't speak for Sphinx, but in Lucene/Solr (
> http://lucene.apache.org/solr) the common step to do (either before or
> after stemming) is to strip accents, such that entã would become enta, which
> would then allow users who both know the correct accents and those who don't
> to still get matches.  In short, the problem isn't in Snowball, it is either
> in your setup of Sphinx or Sphinx itself in that it doesn't allow you to
> strip accents.
>
> Hope this helps,
> Grant
>
> On Aug 19, 2009, at 4:31 PM, Leonardo Borges wrote:
>
> Hello guys,
>
> I am currently evaluating Sphinx as an option for my projects and, since I
> am brazilian, wanted to give it a try to the Portuguese stemmer you guys
> provide.
>
> Thus, I compiled sphinx with the libstemmer option and everything went
> great.
>
> Given the following phrase, in one of my documents: "Então, vamos começar a
> usar libstemmer"
>
> The following searches return the correct document:
> "Então", "Entã", "então", "entã"
> which is great, but if I search for:
> "Entao"
> It returns nothing.
>
> Since I didn't dig into the algorithm, is this the expected behavior? In
> that case, the way to accomplish what I'm trying is removing accents myself?
> Or perhaps you guys have other suggestions?
>
> Thanks a lot,
> Leonardo Borges
> www.leonardoborges.com
> _______________________________________________
> Snowball-discuss mailing list
> Snowball-discuss at lists.tartarus.org
> http://lists.tartarus.org/mailman/listinfo/snowball-discuss
>
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
> Solr/Lucene:
> http://www.lucidimagination.com/search
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.tartarus.org/mailman/private/snowball-discuss/attachments/20090820/ad723dac/attachment-0001.htm 


More information about the Snowball-discuss mailing list