[Xapian-devel] A Hebrew stemmer based on libhspell

Asaf Bartov asaf.bartov at gmail.com
Sun Apr 11 19:54:04 BST 2010


Hello.

I'm interested in creating a Hebrew stemmer to use with Xapian.  Hebrew is a
complicated language to stem, as it uses the semitic "root" system, rather
than prefixes and suffixes, and has many irregularities in accidence
(morphology).

Fortunately, two bright fellows from the Technion University in Israel have
already created a Hebrew morphological analyzer as part of their Hebrew
spellchecker project (hspell), which is the de-facto standard free Hebrew
spellchecker (used in GMail etc.).  This analyzer is heavily lexicon-based,
and is therefore difficult to express as a Snowball program.

Since hspell offers a convenient API (give a word, get a list of possible
stems -- yes, Hebrew is very ambiguous, too, so a single form may have two
or even more possible stems -- I mean completely different words, not
variations), I want to leverage libhspell in Xapian without going through
Snowball at all.

I took a quick look at xapian-core, and I see that stem.cc seems to have
some accommodation for an abstraction of a stemming algorithm, but on the
other hand, get_available_languages() would return LANGSTRING, which is
generated in the allsnowballheaders.h file, which assumes Snowball is used
for all stemmers.

So I'm a little confused about this.  Can anyone shed light on the status of
generic stemming -- is this half-written support?

It seems to me I could instantiate an ExternalHebrewStemmer of my own
making, calling libhspell instead of Snowball.  What do you think?

Thanks,

   Asaf Bartov, Wikimedia Israel
-- 
Asaf Bartov <asaf.bartov at gmail.com>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20100411/1991dfab/attachment.htm>


More information about the Xapian-devel mailing list