[Snowball-discuss] newbie does snowball remove stop words?

Olly Betts olly at survex.com
Wed Mar 25 04:40:50 GMT 2015


On Tue, Mar 24, 2015 at 10:04:04AM -0700, Andrew Davidson wrote:
> thanks. sound easy enough for english. Do you know where I can find
> list of stop words for other languages 

The snowball website has lists for many languages - probably the easiest
way to get them all is to clone the repo for the website from github:

$ git clone git at github.com:snowballstem/snowball-website.git
$ cd snowball-website
$ git ls-files '*/stop*.txt'
algorithms/danish/stop.txt
algorithms/dutch/stop.txt
algorithms/english/stop.txt
algorithms/finnish/stop.txt
algorithms/french/stop.txt
algorithms/german/stop.txt
algorithms/hungarian/stop.txt
algorithms/italian/stop.txt
algorithms/norwegian/stop.txt
algorithms/portuguese/stop.txt
algorithms/russian/stop.txt
algorithms/spanish/stop.txt
algorithms/swedish/stop.txt
otherapps/oregan/stopwords.txt

It's not clear from the filename, but the final one is for Irish.

To elaborate on why Snowball doesn't provide a mechanism to check for
stopwords, firstly it's easy to do (most languages provide a data
structure which can do an efficient check for a string being a member of
a set of strings), but also you don't generally want to tie stopwords
and stemming together.

If you cull all stopwords at index time, you can't search for them -
e.g. the Shakespeare quote "to be or not to be" is composed entirely of
stopwords.  So you probably only want to check for stopwords at search
time, and even then not remove them unconditionally.

Cheers,
    Olly



More information about the Snowball-discuss mailing list