[Snowball-discuss] More patches

Olly Betts olly at survex.com
Sun Feb 11 22:56:11 GMT 2007


I'm currently updating Xapian to use UTF-8 stemmers generated by the
latest version of snowball.  I've patched the snowball compiler to
generate the stemmers as C++ classes, and I'm embedding the patched
compiler in the Xapian build system, so Xapian users can easily drop
in new stemmers.

For ease of forward maintenance I'd like to merge changes back into
the official snowball tree where possible, so I've unpicked the changes
into patches with a single purpose each, the first batch of which are
described and linked to below.

I'm not expecting you'll want all to apply all the changes I've made,
but I'm offering them all for completeness.  It'll also serve as
documentation for what is different.

Let me know if you've any questions!

OK, here are the patches:

Fix a typo of a function name in a comment:

http://oligarchy.co.uk/xapian/patches/snowball-add_to_b-comment-correction.patch

This improves the shortcutting of backwards among - if there are fewer
characters available than the shortest string in the among, there's
no way it can match.  It also includes a cosmetic tweak (avoiding
generating "z->c - 0" in the output) which makes the generated source
a little more readable (of course the C compiler will optimise the "- 0"
away anyway):

http://oligarchy.co.uk/xapian/patches/snowball-min-length-shortcut-backwards-among.patch

ISO C doesn't require that pointers to different types have the same
representation, except that void* and pointers to character types must
have the same representation, as must pointers to qualified and
unqualified versions of compatible types.

So, for example, if sizeof(int) is 4, it's possible that int* pointers
could contain the same bit pattern as a void* pointer to the same
object, but shifted right by 2 bits.

Therefore it's not portable to just cast the type of a function pointer
if the argument types are pointers with potentially different
represenations as the arguments won't get converted.  I'm not sure if
any current platforms actually use different pointer representations, so
the portability improvement may be a somewhat theoretical one, but it's
easy enough to make the code standard conforming:

http://oligarchy.co.uk/xapian/patches/snowball-sort-function-casting.patch

This patch changes snowball to use the ISO C qsort() function instead
of the custom sort routine in sort.c.  My motivation is to keep
the number of source files down, but it seems very similar in
performance (cachegrind suggests sort.c is a few cycles faster on
x86-64 linux, but the difference is so small you couldn't measure it
by normal means) so maybe it's worth considering the change for the
mainstream snowball compiler too:

http://oligarchy.co.uk/xapian/patches/snowball-use-qsort.patch

This patch disables java support, which we don't use and so can save a
source file.  Really just included for completeness, though perhaps it
would be useful to apply with a "#define JAVA_SUPPORT" added to make it
easy for others to disable the java support if they don't need it:

http://oligarchy.co.uk/xapian/patches/snowball-disable-java-support.patch

Cheers,
    Olly



More information about the Snowball-discuss mailing list