[Snowball-discuss] Stop word lists
Olly Betts
olly@survex.com
Tue Oct 8 21:04:01 2002
On Tue, Oct 08, 2002 at 01:38:19PM -0600, Martin Porter wrote:
> The Google stopword list is very interesting. The basic list for English,
> is, in my experience
>
> { the a and of to in an }
>
> which works well on titles technical papers.
>
> I rather doubt the 'en' is there because it is a French/Spanish word. It is
> not all that common - much less common than 'de' for example. Could it be
> connected with the language code for English do you think?
Just checked and "de" is also a Google stopword. This might be new, or
might be because I based my tests on an english wordlist
(/usr/share/dict/words on Linux) so "en" was in, but "de" wasn't.
I thought I'd also checked for all 1-3 letter combinations, but it
was a while ago, and my memory is hazy.
[To try this yourself, search for a valid word and up to 9 stopwords
candidates - e.g "test de en et le la les der die das" - 10 at most
because Google truncates queries at 10 non-stopwords]
So Google also stops "de" and "la" (but not "le" oddly). There may be
others of course.
Cheers,
Olly