[Snowball-discuss] Multiple errors in generated Java sources for Latin algorithm

Olly Betts olly at survex.com
Sun Jun 11 08:03:35 BST 2017


On Tue, Jun 06, 2017 at 05:40:07PM +0300, Alexander Myltsev wrote:
> On 4 June 2017 at 11:51:14, Martin Porter wrote:
> > The Latin stemmer was included in the snowball release rather as a 
> > theoretical exercise. I just wanted to show the ease of putting the 
> > Schinke rules into a snowball form. The unusual feature of producing 
> > two stems does seem rather unsatisfactory, if one is concerned with 
> > practical IR work. 
> 
> Do you think “two stems” would be fixed anytime soon?

Schinke's algorithm is defined as producing two stems, one for the noun
form and one for the verb form.  So this is a feature not a bug and
can't really be "fixed", at least not while still being an
implementation of Schinke's algorithm.

> On 4 June 2017 at 01:11:57, Olly Betts wrote:
> > The java backend attempts to avoid writing out unreachable code, because 
> > the designers of Java decided that unreachable code should be a 
> > compile-time error. While that may make sense for human-written code, 
> > it's unhelpful when generating code, but that's the situation we have to 
> > work with. 
> > 
> > There's a bug with this currently, as the end of this function clearly 
> > can be reached. If I disable the elision of unreachable code, the 
> > generated latinStemmer.java has a "return true;" at the end of that 
> > function (and that's the only difference). 
> 
> OK. I wonder what part of latin.sbl causes to produce invalid Java
> code? Since all other algorithms works (I hope in Java too), may be
> there is a workaround to make “latin” work in Java until it fixed?

I've pushed a fix for this.

On Wed, Jun 07, 2017 at 12:46:56AM +0300, Alexander Myltsev wrote:
> Also, I wonder why Java stemmer produces incorrect results? As I
> wrote, stem for “datum” is “dat” (as libstemmer_c produces) but not
> “datum” (as Java stemmer produces). My guess is that there is a bug on
> line 311 of latinStemmer.java. Could you comment on that?

There's still a remaining issue though, which I think is due to a known
issue with $ applied to a string variable that we noticed recently while
working on merging the new Go backend, and I think affects all the
backends except the C one.  The problematic construct doesn't feature in
any of the included stemmers, but the Latin stemmer uses it.

I don't see an easy way to change the stemmer code to avoid this, and
really it makes more sense to direct effort towards fixing the bug with
$ applied to a string.  Backends should all support the language
correctly.

Cheers,
    Olly



More information about the Snowball-discuss mailing list