[Snowball-discuss] Suggestion: 'aci'(<-'at')
James Aylett
james-xapian at tartarus.org
Wed Apr 2 08:59:47 BST 2014
On 2 Apr 2014, at 08:33, Chris Hennick <christopherhe at trentu.ca> wrote:
> We have to keep in mind, though, that lots of subjects include jargon terms and neologisms that won't be found in word lists. I've had to cluster a collection of documents in which over a dozen contained the words "transhuman" and "Singularitarian"; YAWL contains neither word.
>
> Maybe what's needed are compile-time options to make several versions of the stemmer. In addition to having texts on different subjects and written in different sociolinguistic registers, users must also differ in whether they want speed emphasized over accuracy, sense-disambiguating suffixes preserved (e.g. giving "conservative" a different stem from "conservation") or not, and "-ist" and "-arian" stemmed through or not.
What you're talking about sounds like different stemmers, to me; given how slowly the porter2-based English stemmer changes (the last change was 7 years ago I believe!), maintaining a separate .sbl file should not prove onerous.
Possibly there's some work in the snowball distribution worth doing to make it easier to "roll in" external snowball languages; Martin or Richard would be better placed to discuss that.
J
--
James Aylett, occasional trouble-maker
xapian.org
More information about the Snowball-discuss
mailing list