[Snowball-discuss] Several errors in the swedish vocabulary

Olly Betts olly at survex.com
Mon Aug 7 01:27:45 BST 2023


On Sun, Aug 06, 2023 at 08:03:39AM +0100, Martin Porter wrote:
> The vocabulary does not help define the algorithm but illustrates its
> use, see the last paragraph in section 1 of
> http://snowball.tartarus.org/texts/introduction.html

For readers, note that's the old (and long-frozen) snowball website.
The current website URL is:

https://snowballstem.org/texts/introduction.html

> Nonsense words and mis-spellings are helpful in this, and should not
> be removed: they reflect a common enough feature of real text.

I'd tend to agree.

To expand on this point a little, the main purposes of the vocabulary
lists and the corresponding output.txt files listing the expected
stemmed equivalents are:

* To serve as a testsuite for the Snowball compiler and stemmer
  implementations - `make check` verifies that the code the Snowball
  compiler outputs in each of the supported programming languages (C,
  Python, etc) stems the words in the vocabulary to give the expected
  answers.

* To provide a basis for evaluating any proposed change to the
  stemming algorithm.

For both cases, it's useful for the list to represent inputs which the
stemmer will encounter in the intended field of use (indexing and
searching text in a particular language), so including common typos,
proper nouns, and foreign words is not a bug as it's useful to consider
how the stemmer handles such cases.

If a voc.txt contains a disproportionate quantity of such words, or many
words that aren't justifiable by the above, then that is probably worth
reporting.  For example, we'd certainly want to hear if swedish/voc.txt
was actually a list of Norwegian words, or (as has happened) if
romanian/voc.txt used incorrect character encodings.

Cheers,
    Olly



More information about the Snowball-discuss mailing list