[Snowball-discuss] a simple algorithm problem
James Aylett
james at tartarus.org
Thu Jan 6 11:18:36 GMT 2005
On Thu, Jan 06, 2005 at 10:20:43AM +0000, Martin Porter wrote:
> >Presumably this still restricts Snowball to code points in the BMP? Or
> >does it just restrict it to recognising and doing things with
> >characters at code points in the BMP, passing through any others?
>
> It would be the latter. Since stemming is applicable to a system of
> languages, all of whose characters are, I would assert, in the BMP, I do
> think that is a problem.
I'd agree that, right now, acting on code points outside the BMP
shouldn't be needed. Providing other characters are passed through, it
will cope happily with anything I can reasonably think of throwing at
it :-)
> >What's the character encoding of snowball scripts at the moment?
> The scripts themselves are in ASCII, and ASCII assumptions are made in the
> Snowball compiler.
So when you were talking about strings being in UTF-8, were you
talking about input and output only? I wasn't awarethe concept of
'string' applied to anything other than things in the Snowball
language itself ... or do you mean that strings would be stored
internally to a running snowball stemmer in UTF-8?
Cheers,
James
--
/--------------------------------------------------------------------------\
James Aylett xapian.org
james at tartarus.org uncertaintydivision.org
More information about the Snowball-discuss
mailing list