[Snowball-discuss] Pure-Python implementation of the Snowball stemmers

Martin Porter martin.f.porter at gmail.com
Fri Aug 8 11:36:49 BST 2014


Florian,

I have downloaded your system and had a quick look. In the comments
below remember I don't know python, so may be making errors ...

I strongly suspect the slow performance is because of the way you do
an among. You order the strings longest to shortest, and do a linear
scan, using s.starts_with(..) in turn to see if the ending matches one
of the strings. The stemmers spend most of their time processing these
amongs, and optimising the among code is critical to get fast
performance.

The algorithm for finding the longest matching substring in an among
in the  C source code is in the coding of find_among and find-among_b
of utlilities.c, and is well worth studying: for n strings you do
O(log n) (i.e. "order log n") character comparisons (not string
comparisons). The java code copies it, It does however require careful
replication of the tables which drive it all. Even if you don't go
that far, a switch of some sort will give a much better result.

The amongs usually deliver no match, so an initial test to bypass the
among will also speed things up greatly. For example, with your list,

((u'alize', '', 0), (u'icate', '', 1), (u'iciti', '', 1), (u'ative',
'', 2), (u'ical', '', 1), (u'ness', '', 2), (u'ful', '', 2),)

you can't have a match unless the last letter is e,i,l or s, and
the last but one letter is a,s,t,u,v or z. This can be established by
table lookup.

Martin



More information about the Snowball-discuss mailing list