[Xapian-discuss] Term extraction with Xapian
Charlie Hull
charlie at juggler.net
Tue Feb 14 15:44:45 GMT 2006
Olly Betts wrote:
>
>> Let's say I have a raw text of 300 words. I want to extract terms
>> (nouns/phrases) like "ipod nano", "sony z1", "tom cruise", etc
>>
>> I wonder how I could do that with Xapian (which provide really good
>> performance!) using its termlist and maybe some fuzzy logic operators ?
>
> If you can pull out the noun phrases and add them as terms at index
> time, you can use relevance feedback to do the filtering (via the
> Xapian::Expand class). There are GPL part of speech taggers, but
> I've not tried any of them. You might be able to get by with some
> heuristics (e.g. capital letters, words containing numbers) to pick
> suitable word pairs.
>
> Cheers,
> Olly
We've got a library that Richard wrote that does this kind of thing,
called AyeAye. It uses various heuristics to extract terms from plain
text or HTML.
Charlie
www.lemurconsulting.com
More information about the Xapian-discuss
mailing list