[Snowball-discuss] Optimising among

Martin Porter martin.porter at grapeshot.co.uk
Mon Sep 18 12:54:42 BST 2006


Olly,

This is very helpful, and many thanks. We will certainly incorparate this
extension.

I had a few points:

0) In general, we must be extremely careful now not to introduce any bugs
during codegenerator enhancement. There is nothing more demoralising than
compiler bugs, and now that we are beginning to receive Snowball submissions
from 3rd parties we must be most careful to avoid them. Interestingly,
although the earlier submission had a bug in the "z->c == z->l" test
(missing the possibility of a null string in the among(...), the more
elaborate 2nd submission is bug free -- as far as I can tell. But we must do
all the testing we can. (I have a test suite anyway, so does Richard.)

1) A pity in some ways that Richard suggested special code for fewer than 3
strings in an among, since that must be rare surely? No strings is
forbidden. For 1 string, among('a' C) is the same as 'a' C, and so not
useful. For 2 strings, among('a1' (C1) 'a2' (C2)) might happen, but I
suspect is rare, and in any case one doubts if the double test on a
character is faster than the bitmap approach (see point 4).

2) Perhaps ~I3 (at one point in the code) should be ~I3L, since we are up to
32 bits, or is that pedantic?

3) The case that doesn't work so well is a string-forward among(...) in utf8
mode when the strings being tested have leading characters > 128. Then the
first byte of the utf8 characters will often be the same, and the
optimisation doesn't pull out any exceptions (utf-8 cyrillic for example).
It did occur to me that perhaps only the string-backward case is worth
optimising.

4) I think we should dispense with the 'block' idea. The bitmap test is a
way if discarding unwanted cases, and if a match is found with the bitmap on
a character that does not begin an among string, the among function then
finds it can be discarded. This double test will happen only rarely, and you
save one test in the optimisation, as well as generating the optimisation
for all amongs, rather than the subset where leading characters of strings
are in the same block.

Martin





More information about the Snowball-discuss mailing list