[Snowball-discuss] Suggestion: 'aci'(<-'at')

Chris Hennick christopherhe at trentu.ca
Wed Apr 2 08:33:58 BST 2014


On 1 April 2014 09:37, Martin Porter <martin.f.porter at gmail.com> wrote:

> In short, I think suffix inclusion must be judged against percentage
> improvement and suffix rarity.
>

We have to keep in mind, though, that lots of subjects include jargon terms
and neologisms that won't be found in word lists. I've had to cluster a
collection of documents in which over a dozen contained the words
"transhuman" and "Singularitarian"; YAWL contains neither word.

Maybe what's needed are compile-time options to make several versions of
the stemmer. In addition to having texts on different subjects and written
in different sociolinguistic registers, users must also differ in whether
they want speed emphasized over accuracy, sense-disambiguating suffixes
preserved (e.g. giving "conservative" a different stem from "conservation")
or not, and "-ist" and "-arian" stemmed through or not.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.tartarus.org/mailman/private/snowball-discuss/attachments/20140402/f030432d/attachment.html>


More information about the Snowball-discuss mailing list