[Snowball-discuss] Mismatch between vocab.txt and output.txt

James Aylett james@tartarus.org
Mon Oct 14 15:28:01 2002


On Mon, Oct 14, 2002 at 06:08:14PM +0400, Oleg Bartunov wrote:

> As I understand, stemmer in definition uderstand any word !
> So, I don't see any chance to stem bilingual documents.

Stemming documents with different stemmers across the entire document
based on language is pretty easy. Many document formats have ways of
indicating the language in use, such as HTML's lang attribute. Indeed,
you can mix and match different languages using this method.

(When using within an IR system for searching over the documents it
becomes slightly more tricky. I'm sure someone has done some proper
thinking on the matter; the best I've come up with is to create index
terms from both the stemmed output and a marker indicating the source
language. When searching, you might have to generate search terms for
ambigous searches by stemming for all languages that have been used in
creating the database; hopefully most searches would be more
obvious. You could use language detection methods to make this
automatic in some cases ...)

J 

-- 
/--------------------------------------------------------------------------\
  James Aylett                                                  xapian.org
  james@tartarus.org                               uncertaintydivision.org