[Xapian-discuss] QueryParser stemming
Tim Brody
tdb01r at ecs.soton.ac.uk
Mon Jun 13 16:41:49 BST 2005
Olly Betts wrote:
> On Thu, Jun 09, 2005 at 11:57:54AM +0100, Tim Brody wrote:
>
>>I'm considering expanding Xapian to cover all of my search fields:
>>authors => A
>>title => T [stemmed]
>>description => D [stemmed]
>>date => Y [range?]
>>(fulltext => F) [stemmed]
>>
>>I would like to allow users to specify a query e.g.
>>Brody impact analysis 2004
>>
>>If the user isn't explicit with prefixing I need to be able modify the
>>query terms (e.g. 'Brody' is an author name) to apply stemming and
>>prefixing as appropriate.
>
> I don't follow how you know 'Brody' is meant to be an author name.
> Assuming all capitalised words are author names seems likely to
> frustrate anyone who doesn't read and memorise the help.
The first query anybody gives a citation index is their own name - I
want to get that search right (e.g. if someone enters 'Hawking' I want
to first list papers by Stephen Hawking, then all papers that contain
'hawk' as a term). It's not difficult to maintain an author name
vocabulary to pre-fetch from.
>>I don't think I can achieve this with the current Perl bindings. To do
>>title OR description OR fulltext I need to iterate over the terms and
>>add the appropriate prefix for each field. Similarly I will want to stem
>>title/description terms, but leave author terms alone.
>>
>>So, is this feasible? Is there a better approach?
>
> For searching over all fields, you can do the work at index time instead
> of search time (with the exception of the non-stemming), which is likely
> to give a faster search. I'd probably recommend that approach.
>
> So for the author, title, and description fields, you generate both the
> prefixed terms, and non-prefixed ones. Except you need to stem the
> non-prefixed author terms then. I don't see an easy way to avoid that.
>
> As for not wanting the same stemming strategy for all fields,
> QueryParser::add_prefix() should probably take a stem_strategy argument
> which overrides the main setting.
I think this is the only way to achieve what I want (from Perl anyway).
An alternative would be to call Stem with the current prefix which would
provide complete flexibility.
>>Shall I start adding the Internals to Perl's bindings?
>
> The interfaces to the Internals classes are subject to arbitrary change
> without notice. It doesn't make sense to try to wrap them.
>
> Anyway, the binding layer is the wrong place to add this in my view. We
> don't really want to add generic functionality there - that belongs in
> the core library where it's accessible to all users. Wrapping things in
> a way more natural to the language is fine - for example lazy lists
> instead of iterators. That's inherently language specific.
It would be useful to be able to manipulate a query after it's been
built by the QP. A simple thing to expose might be the serialisation -
stored queries and all that!
>>(And what happened to my patches? :-)
>
>
> I'm working through them. There are some changes which are good but I
> want to generalise. So far I've made == and != work the same as 'eq'
> and 'ne' on all iterators, not just TermIterator - that's all applied
> and committed. Also, being able to use Perl lists which wrap iterators
> should be available everywhere really. I've stalled a little on that
> because we're really going to want lazy evaluation for some cases (e.g.
> Database::allterms) and I need to read up on how that's done.
Can overload '<>' in a new class that contains the begin and end.
This would allow:
while(defined(my $term = <$it>)) {
}
Array overloading '@{}' the same class would provide list-access
(complete termlist would go into memory):
for(@$it) {
}
Mixing access methods would result in missing terms.
TIEing an array can't be implemented efficiently with iterators, and
would need a class per iterator-use vs. one class per iterator above.
All the best,
Tim.
More information about the Xapian-discuss
mailing list