[Snowball-discuss] Stop word lists

Olly Betts olly@survex.com
Tue Oct 8 13:10:02 2002


On Tue, Oct 08, 2002 at 04:34:28AM -0600, Martin Porter wrote:
> If x is the Snowball stopword list for English, and y is the English
> stopword list you sent me [...] :

> x = { a about above after again against all am an and any are as at
> be because been before being below between both but by did do does
> doing down during each few for from further had has have having he
> her here hers herself him himself his how i if in into is it its
> itself me more most my myself no nor not of off on once only or other
> our ours ourselves out over own same she so some such than that the
> their theirs them themselves then there these they this those through
> to too under until up very was we were what when where which while
> who whom why with you your yours yourself yourselves }
> 
> y = { a about after all also an and any are as at be because been but
> by can co corp could for from had has have he her his if in inc into
> is it its last more most mr mrs ms mz no not of on one only or other
> out over s says she so some such than that the their there they this
> to up was we were when which who will with would }
> 
> x-y = { above again against am before being below between both did do
> does doing down during each few further having here hers herself him
> himself how i itself me my myself nor off once our ours ourselves own
> same theirs them themselves then these those through too under until
> very what where while whom why you your yours yourself yourselves }
> 
> y-x = { also can co corp could inc last mr mrs ms mz one s says will
> would }

You may be interested in this list, which is what google seems to use
(reverse engineered by running queries and looking at which it says
have been ignored, so it's possible there are omissions):

g = { a about an and are as at be by en for from how i in is it of on or
that the this to was what when where which who why will with }

That's 33 entries.  This was generated a few months ago from the Google
web search, and is the list which Xapian's omega search currently uses
for search time stopping (the only stopword lists I had to hand were
already stemmed).

I feel this is an interesting list as it's probably seen more use than
any other stopword list (millions of uses every *day*), and has
presumably been carefully tuned, at least for web searching applications
with a general audience.

Note that you can force searching for most of these by prefixing the
term with a + (maybe all now - I thought that the was always stopped,
but either my memory is faulty or they've changed that).  Also terms
in a phrase aren't stopped.

Anyway, comparing it with x and y:

g-x = { en }
g-y = { en how i what where why }

I imagine "en" is included because it's a stopword in French and Spanish
(and a very obscure word in English - a printing measurement).

I noticed recently that "image" is a stopword in the Google image search
(which probably makes sense).

> I am still doubtful about some of the entries: 'very' and 'further'
> for example.

Further in particular.  It's not a particularly common word, and
"further education" is hard to search for if the stopping is done at
index time.

Cheers,
    Olly