[Snowball-discuss] Snowball 2.0.0 released

Olly Betts olly at survex.com
Fri Oct 4 22:48:07 BST 2019


On Fri, Oct 04, 2019 at 11:45:02AM +0200, Yann Barsamian wrote:
> I had a look at the changes, that's a great work ! You wrote in the "NEWS"
> file that you made some updates of the Java code to match the c code.
> Regarding the correspondence between the two codes, I still have one
> question:
> 
> * the basic function that has to be used in Snowball is the "stem" function.
> In the Java code, it is always the "boolean stem()" method --- whose
> implementation depends on the class --- and for the c code, it is the "int
> XXXlanguage_XXXencoding_stem(struct SN_env * z)" function.
> 
> * the stem() methods seem all to always return true, while the corresponding
> C functions seem to be able to return values different from 1 (e.g., if the
> prelude returns a value < 0). I was wondering why there is this difference ?

In the generated C, the return value conveys one of two things:

* If it's >= 0, it indicates the Snowball signal (0 for `f`, otherwise `t`)
  that caused us to exit the Snowball external function that was called.

* If it's < 0 that indicates a runtime issue (such as failure to
  allocate memory or an out of bounds cursor value).

For Java and most (perhaps all) of the other language generators the
second case is instead handled by throwing an exception, so the return
value just gives you the Snowball signal (and so naturally the return
type becomes a purely boolean one).

For the stemmers we current ship, the return value has no particular
meaning, but these stemmers can actually return `false` in some cases in
their current implementations:

$ for f in java/org/tartarus/snowball/ext/*Stemmer.java ; do sed '/^public boolean stem() /,/^}/ s/return/&/p;d' "$f"|grep -q 'return false' && echo "$f"; done
java/org/tartarus/snowball/ext/greekStemmer.java
java/org/tartarus/snowball/ext/hindiStemmer.java
java/org/tartarus/snowball/ext/indonesianStemmer.java
java/org/tartarus/snowball/ext/russianStemmer.java
java/org/tartarus/snowball/ext/tamilStemmer.java
java/org/tartarus/snowball/ext/turkishStemmer.java

> In the test Java programs, stem() is called without checking the return
> value, so if things are working like that, it is maybe better to make it a
> void method and clearly indicate a difference in functionnality from the c
> code ?

While we don't currently make use of this return value, in the wider
context the signal from a Snowball program is potentially interesting,
and it conceivably could be for a stemmer as it could indicate something
about the stem - e.g. for a stemmer which can return multiple stems
(e.g. noun stem vs verb stem) if could indicate if there's more than one
stem available.

It also would complicate code generation - the Snowball language allows
an external function to be called from Snowball, so to handle that case
we'd need to have two versions of the external - one like the current
one for internal use and then a wrapper to actually expose externally
with a `void` return type.  We only need to do use this mechanism when
an external is used internally (which none of the current stemmers do)
but the mechanism still needs to be implemented and maintained.  Or else
we change the language to disallow calling externals, which I'm hesitant
to do without a more compelling reason.

So overall I don't think it makes sense to change the return type to
`void`.

Cheers,
    Olly



More information about the Snowball-discuss mailing list