[Snowball-discuss] problems with Finnish

Martin Porter martin_porter@softhome.net
Sat Sep 21 09:52:02 2002


Alex, (and Richard),

I realise from your email what the problem with Finnish is. I added a
feature into Snowball fairly late on, in which in an 'among' a test for a
string can be supplemented by a function call, which becomes part of the test:

    'u' not_after_q
    'us' preceded_by_vowel

- that sort of thing. I was using it for work with the so-called Lovins
stemmer, and it is not used with any of the stemmers on general release -
apart from Finnish. The Finnish stemmer was done very recently. To get it
working without using this new feature would have been really quite
difficult, such are the complexities of the Finnish language.

(More recently I've got an interpreted version of Snowball working - not
part of the general release of Snowball -, and in this case the
supplementary functions proved rather difficult to implement - so much so I
was rather regretting ever having introduced it. But I got it working ...)

Richard Boulton's Java generated code does not at the moment implement these
supplementary function calls, since they went into Snowball after he had
written the Java codegenerator. I had quite forgotten this when releasing
the Finnish stemmer. So apologies, Alex, for the time you've spent
discovering that. 

We'll have to wait for a reply from Richard to see whether he's prepared to
do more work here, but meanwhile we must remember that Finnish stemming is
hardly in great demand! (I might have a go at adding it in, but it means
getting into Java again.) 

----

But [this is to Richard] it is not too hard to implement. The routine that
interprets the 'among' structure contains a call back into the generated
code corresponding to a call of the supplementary function. You just need to
add this in in the code which you hand-translated into java - and you told
me that was done very easily.

Regarding Russian, the java and C systems have been tested, and match, the
the issue must be the character set. Are you using Unicode without 'symbol'
set to two-bytes?

Martin