[Snowball-discuss] Stop word lists

Olly Betts olly@survex.com
Tue Oct 8 21:04:01 2002


On Tue, Oct 08, 2002 at 01:38:19PM -0600, Martin Porter wrote:
> The Google stopword list is very interesting. The basic list for English,
> is, in my experience
> 
>    { the a and of to in an }
> 
> which works well on titles technical papers.
> 
> I rather doubt the 'en' is there because it is a French/Spanish word. It is
> not all that common - much less common than 'de' for example. Could it be
> connected with the language code for English do you think?

Just checked and "de" is also a Google stopword.  This might be new, or
might be because I based my tests on an english wordlist
(/usr/share/dict/words on Linux) so "en" was in, but "de" wasn't.
I thought I'd also checked for all 1-3 letter combinations, but it
was a while ago, and my memory is hazy.

[To try this yourself, search for a valid word and up to 9 stopwords
candidates - e.g "test de en et le la les der die das" - 10 at most
because Google truncates queries at 10 non-stopwords]

So Google also stops "de" and "la" (but not "le" oddly).  There may be
others of course.

Cheers,
    Olly