[Snowball-discuss] Several errors in the swedish vocabulary
Olly Betts
olly at survex.com
Mon Aug 7 01:27:45 BST 2023
On Sun, Aug 06, 2023 at 08:03:39AM +0100, Martin Porter wrote:
> The vocabulary does not help define the algorithm but illustrates its
> use, see the last paragraph in section 1 of
> http://snowball.tartarus.org/texts/introduction.html
For readers, note that's the old (and long-frozen) snowball website.
The current website URL is:
https://snowballstem.org/texts/introduction.html
> Nonsense words and mis-spellings are helpful in this, and should not
> be removed: they reflect a common enough feature of real text.
I'd tend to agree.
To expand on this point a little, the main purposes of the vocabulary
lists and the corresponding output.txt files listing the expected
stemmed equivalents are:
* To serve as a testsuite for the Snowball compiler and stemmer
implementations - `make check` verifies that the code the Snowball
compiler outputs in each of the supported programming languages (C,
Python, etc) stems the words in the vocabulary to give the expected
answers.
* To provide a basis for evaluating any proposed change to the
stemming algorithm.
For both cases, it's useful for the list to represent inputs which the
stemmer will encounter in the intended field of use (indexing and
searching text in a particular language), so including common typos,
proper nouns, and foreign words is not a bug as it's useful to consider
how the stemmer handles such cases.
If a voc.txt contains a disproportionate quantity of such words, or many
words that aren't justifiable by the above, then that is probably worth
reporting. For example, we'd certainly want to hear if swedish/voc.txt
was actually a list of Norwegian words, or (as has happened) if
romanian/voc.txt used incorrect character encodings.
Cheers,
Olly
More information about the Snowball-discuss
mailing list