[Snowball-discuss] modifying Java English stemmer to accept newexceptions

Martin Porter martin.porter at grapeshot.co.uk
Sat Jun 24 10:42:15 BST 2006


Brian,

I can't comment on the Java code, since Richard Boulton wrote the
codegenerator for it, and undertakes support (I don't know if Richard cares
to comment?), but I would advise modifying the Snowball scripts and
codegenerating the Java afresh. In fact, you pretty well have to do that,
since the 'among's compile into optimized table structures that you can't
really modify by hand (add an extra item and the whole table changes...)

I did at one time collect all the main irregularities of English verbs, with
a view to doing what you are now attempting, although I never put a new
stemmer based on it into service. You have to be very careful: the past of
'see' is 'saw', but a saw is also a cutting tool, and so on. So you might
find the table below, which I put together many years ago, useful. You can
find these tables in dictionaries and grammars, but they are rarely
complete, and are often cluttered with archaic forms that are no longer useful.

Martin

 
Paradigm form            verb list
----------------------------------------------------------------
SMELL  SMELT         (r) burn learn spell smell spill spoil dwell(a)
BEND   BENT              bend build lend send spend rend(a) gird(a)
HIT    HIT           (*) bet burst cast cost(r) cut hit hurt let put
                         quit rid set shed shut slit split spread
                         thrust upset wet(r)
SEW    SEWED   SEWN      sew sow show hew(a) mow(r) saw(r) strew(r)
                         shave(r)
BEAT   BEAT    BEATEN    beat
DRINK  DRANK   DRUNK     begin drink ring shrink sing sink spring stink
                         swim
WIN    WON               cling dig fling sling spin stick sting string
                         swing win wring slink
SIT    SAT               sit spit
BLEED  BLED              bleed breed feed lead meet read speed
GET    GOT               get
HANG   HUNG              hang
FIND   FOUND             bind find grind wind
LIGHT  LIT               light slide
SHINE  SHONE             shine
FIGHT  FOUGHT            fight
STRIKE STRUCK            strike
HOLD   HELD              hold
SHOOT  SHOT              shoot
COME   CAME    COME      come become
RUN    RAN     RUN       run
KEEP   KEPT              creep keep leap sweep sleep weep
SELL   SOLD              sell tell
FLEE   FLED              flee
HEAR   HEARD             hear
SAY    SAID              say
SHOE   SHOD              shoe
MEAN   MEANT             deal dream feel kneel lean mean
BUY    BOUGHT            buy
LEAVE  LEFT              leave bereave(a)
LOSE   LOST              lose
RIDE   RODE    RIDDEN    drive ride rise arise strive write smite(a)
STRIDE STRODE  -         stride
FLY    FLEW    FLOWN     fly
STEAL  STOLE   STOLEN    freeze speak steal weave
BREAK  BROKE   BROKEN    break wake awake
FORGET FORGOT  FORGOTTEN forget tread
BEAR   BORE    BORNE     bear tear swear wear
LIE    LAY     LAIN      lie
BITE   BIT     BITTEN    bite hide
CHOOSE CHOSE   CHOSEN    choose
SEE    SAW     SEEN      see
EAT    ATE     EATEN     eat
FORBID FORBADE FORBIDDEN forbid forgive give bid(a)
TAKE   TOOK    TAKEN     forsake(a) shake take
FALL   FELL    FALLEN    fall
DRAW   DREW    DRAWN     draw
GROW   GREW    GROWN     blow grow know throw
SLAY   SLEW    SLAIN     slay(a)
SWELL  SWELLED SWOLLEN   swell(r)
SHEAR  SHEARED SHORN     shear(r)
MAKE   MADE              make
BRING  BROUGHT           bring think
TEACH  TAUGHT            teach beseech(a) seek(a)
CATCH  CAUGHT            catch
STAND  STOOD             stand understand
GO     WENT    GONE      go
DO     DID     DONE      do

Verbs marked (r) also have regular forms. Verbs marked (a) are archaic. Verbs
marked (*) are irregular, but not in a way that causes difficulties to a
stemming algorithm.

The pp of `hang' is `hanged' or `hung', depending on the sense. `lie' is
irregular when it means `lying down', regular when it means `telling
falsehoods'. `stride' has no pp in normal use.

We are left with 135 verbs with irregularities in the past or pp forms:

 arise awake bear beat become begin bend bind bite bleed blow
 break breed bring build burn buy catch choose cling come creep
 deal dig do draw dream drink drive eat fall feed feel fight find
 flee fling fly forbid forget forgive freeze get give go grind
 grow hang hear hide hold keep kneel know lead lean leap learn
 leave lend lie light lose make mean meet mow read ride ring rise
 run saw say see sell send sew shake shave shear shine shoe shoot
 show shrink sing sink sit sleep slide sling slink smell sow
 speak speed spell spend spill spin spit spoil spring stand steal
 stick sting stink strew stride strike string strive swear sweep
 swell swim swing take teach tear tell think throw tread
 understand wake wear weave weep win wind wring write

plus these 20 invariant forms:

 bet burst cast cost cut hit hurt let put quit rid set shed shut
 slit split spread thrust upset wet

and these 11 archaic forms, which might/might not be included:

 bereave beseech bid dwell forsake gird hew rend seek slay smite






More information about the Snowball-discuss mailing list