[Snowball-discuss] Stop word lists

Martin Porter martin_porter@softhome.net
Tue Oct 8 11:35:02 2002


Alex,


I have now looked at the stopword lists you sent yesterday, and they
have increased my confidence in the quality of the Snowball ones. I have
looked at the English one very carefully, and can report on the
findings.

If x is the Snowball stopword list for English, and y is the English
stopword list you sent me, we can look at the various sets x, y, x-y,
y-x, x or y, x and y. Their sizes are as follows:

    | x |       = 119
    | y |       =  76
    | x-y |     =  59
    | y-x |     =  16
    | x or y |  = 135
    | x and y | =  60

and the sets themselves are,

x = { a about above after again against all am an and any are as at
be because been before being below between both but by did do does
doing down during each few for from further had has have having he
her here hers herself him himself his how i if in into is it its
itself me more most my myself no nor not of off on once only or other
our ours ourselves out over own same she so some such than that the
their theirs them themselves then there these they this those through
to too under until up very was we were what when where which while
who whom why with you your yours yourself yourselves }

y = { a about after all also an and any are as at be because been but
by can co corp could for from had has have he her his if in inc into
is it its last more most mr mrs ms mz no not of on one only or other
out over s says she so some such than that the their there they this
to up was we were when which who will with would }

x-y = { above again against am before being below between both did do
does doing down during each few further having here hers herself him
himself how i itself me my myself nor off once our ours ourselves own
same theirs them themselves then these those through too under until
very what where while whom why you your yours yourself yourselves }

y-x = { also can co corp could inc last mr mrs ms mz one s says will
would }

As you can see, x is substantially larger than y, and the terms in x-y
are plausible stopwords. But if you take the 16 terms in y-x, 6 are
mentioned in the comments in the source of x, and so could always be
picked up by users working from the source:

    auxiliaries: can could will would
    common words: also says

7 are components of names of people and organisations and should only
be treated as stopwords in rather special circumstances:

    co corp inc mr mrs ms mz

which leaves

    s one last.

's' is the second component in words like John's, boy's ... and is
not really a stopword, assuming the indexing is done intelligently.
'last' I don't think should be a stopword ('The Last Detail', 'The
cobbler's last' ...). 'one' on the other hand is an omission from x,
even if it should only be mentioned in the notes. I will fix it up. (I
can see how 'one' came to be omitted, but won't bore you with the
details.)

I will look more closely at the other stopword lists in due course.

Where did they come from? I would like to put the Finnish one in place
in the interim.

------

Actually the English stopword list is the only one I did not make up
myself. It derives from a list which used to be used in IR experiments
in Cambridge and which I have modified over the years. An early form
of it can be found on pp.18-19 of van Rijsbergen's 'Information
Retrieval', Butterworths, 1975. Interestingly, that list contains
'co', which I remember removing many years ago. I am still doubtful
about some of the entries: 'very' and 'further' for example.


Martin