[Snowball-discuss] newbie does snowball remove stop words?
Olly Betts
olly at survex.com
Wed Mar 25 04:40:50 GMT 2015
On Tue, Mar 24, 2015 at 10:04:04AM -0700, Andrew Davidson wrote:
> thanks. sound easy enough for english. Do you know where I can find
> list of stop words for other languages
The snowball website has lists for many languages - probably the easiest
way to get them all is to clone the repo for the website from github:
$ git clone git at github.com:snowballstem/snowball-website.git
$ cd snowball-website
$ git ls-files '*/stop*.txt'
algorithms/danish/stop.txt
algorithms/dutch/stop.txt
algorithms/english/stop.txt
algorithms/finnish/stop.txt
algorithms/french/stop.txt
algorithms/german/stop.txt
algorithms/hungarian/stop.txt
algorithms/italian/stop.txt
algorithms/norwegian/stop.txt
algorithms/portuguese/stop.txt
algorithms/russian/stop.txt
algorithms/spanish/stop.txt
algorithms/swedish/stop.txt
otherapps/oregan/stopwords.txt
It's not clear from the filename, but the final one is for Irish.
To elaborate on why Snowball doesn't provide a mechanism to check for
stopwords, firstly it's easy to do (most languages provide a data
structure which can do an efficient check for a string being a member of
a set of strings), but also you don't generally want to tie stopwords
and stemming together.
If you cull all stopwords at index time, you can't search for them -
e.g. the Shakespeare quote "to be or not to be" is composed entirely of
stopwords. So you probably only want to check for stopwords at search
time, and even then not remove them unconditionally.
Cheers,
Olly
More information about the Snowball-discuss
mailing list