[Snowball-discuss] snowball and combining characters

Olly Betts olly at survex.com
Fri Mar 30 21:38:09 BST 2012


On Fri, Mar 30, 2012 at 06:19:02PM +0100, Jason Spashett wrote:
> What is the situation as regards combining characters vs pre-composed in  
> unicode?
>
> For example:
>
> HEBREW LETTER BET WITH DAGESH
>
> pre-composed FB30
> using combiners 05D0 05BC
>
> Does Snowball recognise these as the same character? I assume not. It is  
> also fair to say that Snowball will count these differently.
> 1 'slot' in the pre-composed case and 2 slots in the combiner case?

Snowball has no specific knowledge of combining characters, so it sees
them as two characters.  That's true of the compiler and support code
at least - you could write code in the snowball language which allowed
for this (e.g.  by standardising the input on one form before doing
anything else), but none of the existing stemmers do as far as I am
aware.

> If this is so, then I assume that the way to proceed might be to convert  
> any combiner representation into the pre-composed form before using  
> Snowball?

Yes, that's probably simplest - there are a number of existing libraries
which can do this.  You want to be a bit careful exactly what
normalisation you do though - e.g. for latin scripts you wouldn't
want to convert the letters ffi to the U+FB03 typographic ligature
- ideally you'd want to do the reverse in fact.  NFKC is probably the
appropriate normalised form to use.

Cheers,
    Olly



More information about the Snowball-discuss mailing list