[Xapian-discuss] Term extraction with Xapian

Charlie Hull charlie at juggler.net
Tue Feb 14 15:44:45 GMT 2006


Olly Betts wrote:
> 
>> Let's say I have a raw text of 300 words. I want to extract terms
>> (nouns/phrases) like "ipod nano", "sony z1", "tom cruise", etc
>>
>> I wonder how I could do that with Xapian (which provide really good
>> performance!) using its termlist and maybe some fuzzy logic operators ?
> 
> If you can pull out the noun phrases and add them as terms at index
> time, you can use relevance feedback to do the filtering (via the
> Xapian::Expand class).  There are GPL part of speech taggers, but
> I've not tried any of them.  You might be able to get by with some
> heuristics (e.g. capital letters, words containing numbers) to pick
> suitable word pairs.
> 
> Cheers,
>     Olly

We've got a library that Richard wrote that does this kind of thing, 
called AyeAye. It uses various heuristics to extract terms from plain 
text or HTML.

Charlie
www.lemurconsulting.com



More information about the Xapian-discuss mailing list