[Snowball-discuss] a simple algorithm problem

James Aylett james at tartarus.org
Thu Jan 6 11:18:36 GMT 2005


On Thu, Jan 06, 2005 at 10:20:43AM +0000, Martin Porter wrote:

> >Presumably this still restricts Snowball to code points in the BMP? Or
> >does it just restrict it to recognising and doing things with
> >characters at code points in the BMP, passing through any others?
> 
> It would be the latter. Since stemming is applicable to a system of
> languages, all  of whose characters are, I would assert, in the BMP, I do
> think that is a problem.

I'd agree that, right now, acting on code points outside the BMP
shouldn't be needed. Providing other characters are passed through, it
will cope happily with anything I can reasonably think of throwing at
it :-)

> >What's the character encoding of snowball scripts at the moment?
> The scripts themselves are in ASCII, and ASCII assumptions are made in the
> Snowball compiler. 

So when you were talking about strings being in UTF-8, were you
talking about input and output only? I wasn't awarethe concept of
'string' applied to anything other than things in the Snowball
language itself ... or do you mean that strings would be stored
internally to a running snowball stemmer in UTF-8?

Cheers,
James

-- 
/--------------------------------------------------------------------------\
  James Aylett                                                  xapian.org
  james at tartarus.org                               uncertaintydivision.org



More information about the Snowball-discuss mailing list