From ostmann at websuche.de Fri Feb 3 10:10:15 2012 From: ostmann at websuche.de (Websuche :: Felix Antonius Wilhelm Ostmann) Date: Fri, 03 Feb 2012 11:10:15 +0100 Subject: [Xapian-discuss] Using synonyms and order of results In-Reply-To: <4F2662A0.2080000@websuche.de> References: <4F1949B5.5030403@fayettedigital.com> <20120121012136.GM17351@survex.com> <4F1D60FD.8060102@fayettedigital.com> <4F2662A0.2080000@websuche.de> Message-ID: <4F2BB287.6050406@websuche.de> I found a simple solution by using OP_AND_MAYBE and OP_SCALE_WEIGHT! The new query: [QUERY: Xapian::Query((0.5 * ((Zstempel:(pos=1) SYNONYM Zamtszeich:(pos=1) SYNONYM Zgrubenholz:(pos=1) SYNONYM Zkennzeich:(pos=1) SYNONYM Zpoststempel:(pos=1) SYNONYM Zpragestempel:(pos=1) SYNONYM Zpunz:(pos=1) SYNONYM Zsiegel:(pos=1)) FILTER QMde) AND_MAYBE Zstempel:(pos=1)))] It also works with multiple terms. Again, xapian is simple and fast! Am 30.01.2012 10:28, schrieb Websuche :: Felix Antonius Wilhelm Ostmann: > We are using FLAG_AUTO_SYNONYMS and it works like a charm (+stemmer), > but we currently have a problem with the order of the results. We think, > the best result will be a result without a synoym. > > We search for stempel (german for chop) and after FLAG_AUTO_SYNONYMS > (+STEM_SOME as stemming strategy) we get the following query: > > [QUERY: Xapian::Query(((Zstempel:(pos=1) SYNONYM Zamtszeich:(pos=1) > SYNONYM Zgrubenholz:(pos=1) SYNONYM Zkennzeich:(pos=1) SYNONYM > Zpoststempel:(pos=1) SYNONYM Zpragestempel:(pos=1) SYNONYM Zpunz:(pos=1) > SYNONYM Zsiegel:(pos=1)) FILTER QMde))] > > [POS: 0] [PERCENT: 100%] [WEIGHT:10.153175] [ID: 64977] [PID: 8876897] > ... > [POS: 38] [PERCENT: 100%] [WEIGHT:8.763471] [ID: 125701] [PID: 9023761] > ... > > Term List for record #64977: QMde QMint Zpunz punzen > Term List for record #125701: QMde QMint Zstempel Zstempelkiss Zstempeln > stempel stempelkissen stempeln > > > Perhaps there is a solution by building a special search or two searches > (first without synonym, second with), but the problem starts, when there > are 3.. terms to search for, all with synoynms. > > Is there a way to prefer results with exact hits? > > _______________________________________________ > Xapian-discuss mailing list > Xapian-discuss at lists.xapian.org > http://lists.xapian.org/mailman/listinfo/xapian-discuss > -- Mit freundlichen Gr??en Felix Antonius Wilhelm Ostmann ----------------------------------------------------------- Websuche Search Technology GmbH & Co. KG Martinistra?e 3, D-49080 Osnabr?ck ----------------------------------------------------------- Tel.: +49 (0) 541 40666 0, Fax: +49 (0) 541 40666 22 Email: info at websuche.de, Web: www.websuche.de ----------------------------------------------------------- HRA 200252, AG Osnabr?ck, Ust-IdNr.: DE814737310 ----------------------------------------------------------- Komplement?rin: Websuche Search Technology Verwaltungs GmbH HRB 200359, AG Osnabr?ck, Gesch?ftsf?hrer: Ansas Meyer ----------------------------------------------------------- Die in dieser Email enthaltenen Informationen sind vertrau- lich zu behandeln und ausschlie?lich f?r den Adressaten be- stimmt. Jegliche Ver?ffentlichung, Verteilung oder sonstige in diesem Zusammenhang stehende Handlung wird ausdr?cklich untersagt. From xapian at networkimprov.net Thu Feb 9 23:50:13 2012 From: xapian at networkimprov.net (Liam) Date: Thu, 9 Feb 2012 15:50:13 -0800 Subject: [Xapian-discuss] Mime2Text library, derived from omindex In-Reply-To: References: Message-ID: On Tue, Nov 22, 2011 at 10:26 PM, Liam wrote: > > load_file() in omega/loadfile.cc (part of the pending Mime2Text lib) calls > > posix_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED); > > once, before closing the fd. In order to minimally impact the filesystem > cache, I suspect it should call that after each read()? > > Also, the read buffer is only 4KB. It might be considerably more efficient > if sized to the filesystem block size? > I believe doing a posix_fadvise() per-read is wise, as 100MB PDFs are not uncommon, and would pollute the filesystem cache. If, given the benchmarks below, you'd agree, I'll commit my edits to loadfile.cc and test program to my github branch. Here are benchmarks from a test program that walks a tree calling load_file(pathname, output_string, NOCACHE | NOATIME) test machine is a Core 2 Duo with low-end disk, Linux kernel 2.6.32-32-generic Note: the pattern of alternating slower/faster runs repeats over many tries Current loadfile.cc, with 4K buffer buffers of 8K 16K 32K 64K showed only a 1-2s speedup $ time ./loadfile-test ~ total bytes read: 627344268 real 0m55.267s user 0m0.424s sys 0m2.504s $ time ./loadfile-test ~ total bytes read: 627344268 real 0m18.937s user 0m0.360s sys 0m1.800s ------------ Moved posix_fadvise() into the read loop the faster pass is somewhat slower than before, tho only the first is relevant here $ time ./loadfile-test ~ total bytes read: 627344302 real 0m59.410s user 0m0.532s sys 0m2.696s $ time ./loadfile-test ~ total bytes read: 627344302 real 0m42.393s user 0m0.428s sys 0m2.376s ------------ Increased the read() buffer to 32K to reduce the number of posix_fadvise() calls $ time ./loadfile-test ~ total bytes read: 627344305 real 0m56.894s user 0m0.472s sys 0m2.300s $ time ./loadfile-test ~ total bytes read: 627344305 real 0m41.719s user 0m0.408s sys 0m1.948s From xapian at networkimprov.net Mon Feb 13 08:33:05 2012 From: xapian at networkimprov.net (Liam) Date: Mon, 13 Feb 2012 00:33:05 -0800 Subject: [Xapian-discuss] Mime2Text library, derived from omindex In-Reply-To: <20120116043439.GM1698@survex.com> References: <20111111051906.GB1698@survex.com> <20120116043439.GM1698@survex.com> Message-ID: On Sun, Jan 15, 2012 at 8:34 PM, Olly Betts wrote: > > > The existing code expects a filename, so I feel we should stick with that > > for the first version of this, although I agree it should take a stream > > type eventually. > > For code only used in omindex, that's fine - we control the caller(s) > and can just change how things work if needed. > > But if we're going to split this off into a library, we're committing to > support the API we provide for a significant length of time, and to > providing sane upgrade paths for any changes which later get made. > OK, true, but we shouldn't define the stream-oriented API until we know exactly what a future omindex needs in that area. So I'd suggest starting with an omindex-internal library with pathname API. When we switch to streams, we can move the library to its own package. > > > Why is "command" in Mime2Text::Fields? It doesn't seem to be a field. > > > > It's for informational purposes, what external command produced these > > results, if any. > > It isn't a field though. > Shall I make it a second output argument? > Liam From olly at survex.com Thu Feb 16 05:38:06 2012 From: olly at survex.com (Olly Betts) Date: Thu, 16 Feb 2012 05:38:06 +0000 Subject: [Xapian-discuss] GSoC 2012 Message-ID: <20120216053806.GA28623@survex.com> Google have announced their "Summer of Code" for this year - for background info see: http://code.google.com/soc/ We took part last year with great success, and after a brief discussion with those who mentored last year, we concluded it was worthwhile applying to take part again. I'm happy to act as admin again and submit the application. I've updated of the list of project ideas for students on the wiki from last year, removing those done tackled last year, and updating those where work has been done outside GSoC: http://trac.xapian.org/wiki/GSoCProjectIdeas If you're interested in acting as a mentor for one of the ideas there, or have an idea for a project with a scope suitable for a student to complete in about 12 weeks, please update that page. Ideas without a potential mentor aren't very useful though, so being willing to mentor your new idea is helpful. Ideas don't have to be for work on Xapian itself - projects related to Xapian in other software are within scope. A wider range of project ideas will give us a broader appeal to students. Mentoring organisation applications open on 27th Feb, close on March 9th, and are reviewed until 15th with accepted orgs announced on 16th, so getting the ideas list into excellent shape before March 9th is the target - a little over 3 weeks away (Google indicate a good list of ideas is a key factor in deciding which orgs to select). You can still add and improve ideas after that of course. If you are a student eligible for GSoC and interested in working on Xapian, please feel free to get in touch. You're welcome to propose your own project idea rather than being restricted to what's on the list. If you want to discuss being a mentor or a student, or a project idea, you can do so on the mailing list or on #xapian on freenode (if you aren't already an IRC user, see http://trac.xapian.org/wiki/GSoC_IRC for links to a web IRC client). There's also a general GSoC IRC channel #gsoc on freenode. Cheers, Olly From xapian at networkimprov.net Thu Feb 16 06:37:16 2012 From: xapian at networkimprov.net (Liam) Date: Wed, 15 Feb 2012 22:37:16 -0800 Subject: [Xapian-discuss] GSoC 2012 In-Reply-To: <20120216053806.GA28623@survex.com> References: <20120216053806.GA28623@survex.com> Message-ID: On Wed, Feb 15, 2012 at 9:38 PM, Olly Betts wrote: > Google have announced their "Summer of Code" for this year - for > background info see: > > http://code.google.com/soc/ > > We took part last year with great success, and after a brief discussion > with those who mentored last year, we concluded it was worthwhile > applying to take part again. > > I'm happy to act as admin again and submit the application. > > I've updated of the list of project ideas for students on the wiki from > last year, removing those done tackled last year, and updating those > where work has been done outside GSoC: > > http://trac.xapian.org/wiki/GSoCProjectIdeas > Re: Text-Extraction Libraries, starting a new process isn't expensive (on the order of 40usec for Linux, I believe), and prevents crashing the main program. So the benefit of libraries vs apps would be saving any extractor-specific initialization time, which I'd guess would be pretty low. If init time is a factor for some extractors, one could rev those programs (if source available) to accept a sequence of filenames via stdin or other input stream. Wouldn't handling archive files (tar, zip) would be the more pressing need in this area? Re: Support Another Language, you might mention the Node.js binding I've been working on? It could use a LOT more Xapian features. I'd be glad to mentor for that. https://github.com/networkimprov/node-xapian Liam From justin at redwiredesign.com Thu Feb 16 10:29:29 2012 From: justin at redwiredesign.com (Justin Finkelstein) Date: Thu, 16 Feb 2012 10:29:29 +0000 Subject: [Xapian-discuss] GSoC 2012 In-Reply-To: <20120216053806.GA28623@survex.com> References: <20120216053806.GA28623@survex.com> Message-ID: <1329388169.23831.8.camel@justin> There're two things I'd like to suggest that aren't on the list: 1. An queueing system, to eliminate or work around the one-writer at a time issue 2. A web service front-end, handling queries via GET, CRUD operations via POST containing XML The idea being to bring Xapian a bit more in-line with some of the other search appliances and to make adoption easier. I'm not sure how these would fit into the Xapian ethos, but it's something I'd like to see developed. On Thu, 2012-02-16 at 05:38 +0000, Olly Betts wrote: > Google have announced their "Summer of Code" for this year - for > background info see: > > http://code.google.com/soc/ > > We took part last year with great success, and after a brief discussion > with those who mentored last year, we concluded it was worthwhile > applying to take part again. > > I'm happy to act as admin again and submit the application. > > I've updated of the list of project ideas for students on the wiki from > last year, removing those done tackled last year, and updating those > where work has been done outside GSoC: > > http://trac.xapian.org/wiki/GSoCProjectIdeas > > If you're interested in acting as a mentor for one of the ideas there, > or have an idea for a project with a scope suitable for a student to > complete in about 12 weeks, please update that page. Ideas without a > potential mentor aren't very useful though, so being willing to mentor > your new idea is helpful. > > Ideas don't have to be for work on Xapian itself - projects related to > Xapian in other software are within scope. A wider range of project > ideas will give us a broader appeal to students. > > Mentoring organisation applications open on 27th Feb, close on March 9th, > and are reviewed until 15th with accepted orgs announced on 16th, so > getting the ideas list into excellent shape before March 9th is the > target - a little over 3 weeks away (Google indicate a good list of > ideas is a key factor in deciding which orgs to select). You can still > add and improve ideas after that of course. > > If you are a student eligible for GSoC and interested in working on > Xapian, please feel free to get in touch. You're welcome to propose > your own project idea rather than being restricted to what's on the > list. > > If you want to discuss being a mentor or a student, or a project idea, > you can do so on the mailing list or on #xapian on freenode (if you > aren't already an IRC user, see http://trac.xapian.org/wiki/GSoC_IRC for > links to a web IRC client). > > There's also a general GSoC IRC channel #gsoc on freenode. > > Cheers, > Olly > > _______________________________________________ > Xapian-discuss mailing list > Xapian-discuss at lists.xapian.org > http://lists.xapian.org/mailman/listinfo/xapian-discuss From charlie at juggler.net Thu Feb 16 10:34:55 2012 From: charlie at juggler.net (Charlie Hull) Date: Thu, 16 Feb 2012 10:34:55 +0000 Subject: [Xapian-discuss] GSoC 2012 In-Reply-To: <1329388169.23831.8.camel@justin> References: <20120216053806.GA28623@survex.com> <1329388169.23831.8.camel@justin> Message-ID: <4F3CDBCF.9050405@juggler.net> On 16/02/2012 10:29, Justin Finkelstein wrote: > There're two things I'd like to suggest that aren't on the list: > > 1. An queueing system, to eliminate or work around the one-writer at > a time issue yes, a good plan > 2. A web service front-end, handling queries via GET, CRUD > operations via POST containing XML We did this a while ago although we didn't take it very far: http://code.google.com/p/flaxcode/source/browse/#svn%2Ftrunk%2Fflax_search_service and I know Richard has also been working on this kind of thing subsequently. Cheers Charlie > > The idea being to bring Xapian a bit more in-line with some of the other > search appliances and to make adoption easier. > I'm not sure how these would fit into the Xapian ethos, but it's > something I'd like to see developed. > > On Thu, 2012-02-16 at 05:38 +0000, Olly Betts wrote: > >> Google have announced their "Summer of Code" for this year - for >> background info see: >> >> http://code.google.com/soc/ >> >> We took part last year with great success, and after a brief discussion >> with those who mentored last year, we concluded it was worthwhile >> applying to take part again. >> >> I'm happy to act as admin again and submit the application. >> >> I've updated of the list of project ideas for students on the wiki from >> last year, removing those done tackled last year, and updating those >> where work has been done outside GSoC: >> >> http://trac.xapian.org/wiki/GSoCProjectIdeas >> >> If you're interested in acting as a mentor for one of the ideas there, >> or have an idea for a project with a scope suitable for a student to >> complete in about 12 weeks, please update that page. Ideas without a >> potential mentor aren't very useful though, so being willing to mentor >> your new idea is helpful. >> >> Ideas don't have to be for work on Xapian itself - projects related to >> Xapian in other software are within scope. A wider range of project >> ideas will give us a broader appeal to students. >> >> Mentoring organisation applications open on 27th Feb, close on March 9th, >> and are reviewed until 15th with accepted orgs announced on 16th, so >> getting the ideas list into excellent shape before March 9th is the >> target - a little over 3 weeks away (Google indicate a good list of >> ideas is a key factor in deciding which orgs to select). You can still >> add and improve ideas after that of course. >> >> If you are a student eligible for GSoC and interested in working on >> Xapian, please feel free to get in touch. You're welcome to propose >> your own project idea rather than being restricted to what's on the >> list. >> >> If you want to discuss being a mentor or a student, or a project idea, >> you can do so on the mailing list or on #xapian on freenode (if you >> aren't already an IRC user, see http://trac.xapian.org/wiki/GSoC_IRC for >> links to a web IRC client). >> >> There's also a general GSoC IRC channel #gsoc on freenode. >> >> Cheers, >> Olly >> >> _______________________________________________ >> Xapian-discuss mailing list >> Xapian-discuss at lists.xapian.org >> http://lists.xapian.org/mailman/listinfo/xapian-discuss > > > > _______________________________________________ > Xapian-discuss mailing list > Xapian-discuss at lists.xapian.org > http://lists.xapian.org/mailman/listinfo/xapian-discuss > From andrew.betts at ft.com Fri Feb 17 12:54:36 2012 From: andrew.betts at ft.com (Andrew Betts) Date: Fri, 17 Feb 2012 12:54:36 +0000 Subject: [Xapian-discuss] DatabaseModifiedError on get_data - best practice? Message-ID: Hi, I have previously had a problem with getting this error on a get_mset call, and solved it by subclassing XapianEnquire with a backoff-and-retry algorithm (as suggested by this list, many thanks!). However, I now get it intermittently when calling get_data on a XapianDocument. The same solution doesn't seem to be quite as easy in this case, because: 1. The document is not instantiated by my code, it's returned from the Iterator, so I can't easily subclass it without editing the bindings. 2. The document doesn't have a reference to the database, so I can't reopen it from that scope. So, first is it necessary to reopen the database in these situations, or could I simply call get_data on the same document object after a brief delay? And second, how/where would you suggest I insert the retry procedure? Currently I can only see a few options, none of which seem very good, and the first two don't solve the reopen problem): A) Subclass XapianDocument, and in order to make the bindings use the subclass, also subclass the iterator, matchset and enquire. B) Hack the bindings to insert the retry into the existing XapianDocument::get_data method. C) Add retry at the application level (need to add to several dozen projects!) Any ideas much appreciated. Cheers, Andrew ********************************************************************************** This email was sent by a company owned by Pearson plc, registered office at 80 Strand, London WC2R 0RL. Registered in England and Wales with company number 53723 From olly at survex.com Sun Feb 19 20:42:00 2012 From: olly at survex.com (Olly Betts) Date: Sun, 19 Feb 2012 20:42:00 +0000 Subject: [Xapian-discuss] GSoC 2012 In-Reply-To: References: <20120216053806.GA28623@survex.com> Message-ID: <20120219204200.GV17351@survex.com> On Wed, Feb 15, 2012 at 10:37:16PM -0800, Liam wrote: > Re: Text-Extraction Libraries, starting a new process isn't expensive (on > the order of 40usec for Linux, I believe), and prevents crashing the main > program. So the benefit of libraries vs apps would be saving any > extractor-specific initialization time, which I'd guess would be pretty > low. If init time is a factor for some extractors, one could rev those > programs (if source available) to accept a sequence of filenames via stdin > or other input stream. If you look at the prototype patch, you'll see this is pretty much what it already does. There's a small helper program which links to libwv2 and takes a filename on stdin and sends back the text for the title, body, etc (which is better than we can achieve with an external extractor unless we run a separate command for the metadata, or can get it to output HTML which we then have to parse). The helper program is a separate process, so we don't crash omindex if the extractor crashes, and the helper is restarted automatically if we come to reuse it and find it isn't running. > Wouldn't handling archive files (tar, zip) would be the more pressing need > in this area? I would say "more pressing" is a subjective assessment, but feel free to add suitable project ideas to the list if you are (or have) someone willing to mentor them. Try to write the idea up so that it is easy to understand for a student who isn't intimately familiar with the area already, with some "resources" for further reading and a list of required or useful skills. > Re: Support Another Language, you might mention the Node.js binding I've > been working on? It could use a LOT more Xapian features. I'd be glad to > mentor for that. https://github.com/networkimprov/node-xapian Again, if it's a suitable scope project (I have little idea of what is involved) and you are willing to mentor, feel free to add it to the list. Cheers, Olly From olly at survex.com Sun Feb 19 21:04:43 2012 From: olly at survex.com (Olly Betts) Date: Sun, 19 Feb 2012 21:04:43 +0000 Subject: [Xapian-discuss] GSoC 2012 In-Reply-To: <1329388169.23831.8.camel@justin> References: <20120216053806.GA28623@survex.com> <1329388169.23831.8.camel@justin> Message-ID: <20120219210443.GW17351@survex.com> On Thu, Feb 16, 2012 at 10:29:29AM +0000, Justin Finkelstein wrote: > There're two things I'd like to suggest that aren't on the list: > > 1. An queueing system, to eliminate or work around the one-writer at > a time issue > 2. A web service front-end, handling queries via GET, CRUD > operations via POST containing XML Isn't (2) essentially what Richard's restpose (http://restpose.org/) aims to do, except it's JSON not XML (which seems to be the modern trend)? > The idea being to bring Xapian a bit more in-line with some of the other > search appliances and to make adoption easier. > I'm not sure how these would fit into the Xapian ethos, but it's > something I'd like to see developed. These seem like projects on top of Xapian to me, and that seems a sensible separation (like how solr is a web services layer on top of lucene). I'm happy to include work on projects like that, but starting a new project is potentially problematic. If the student is engaged enough to stay involved in the longer term, it would work OK, but if the student doesn't hang around much after GSoC you have an orphaned project, which isn't really good for anyone involved. Also, most students will probably do better working within some sort of existing structure rather than trying to start from a clean slate. Cheers, Olly From xapian at networkimprov.net Sun Feb 19 21:26:55 2012 From: xapian at networkimprov.net (Liam) Date: Sun, 19 Feb 2012 13:26:55 -0800 Subject: [Xapian-discuss] GSoC 2012 In-Reply-To: <20120219204200.GV17351@survex.com> References: <20120216053806.GA28623@survex.com> <20120219204200.GV17351@survex.com> Message-ID: On Sun, Feb 19, 2012 at 12:42 PM, Olly Betts wrote: > On Wed, Feb 15, 2012 at 10:37:16PM -0800, Liam wrote: > > Wouldn't handling archive files (tar, zip) would be the more pressing need > > in this area? > > I would say "more pressing" is a subjective assessment, but > You've mentioned reading archive files in omindex before, I believe it was your rationale for a stream-based API (vs filename) to the Mime2Text library I've started on. So I was surprised not to see it listed. feel free to add suitable project ideas to the list if you are (or have) > someone > willing to mentor them. Try to write the idea up so that it is easy > to understand for a student who isn't intimately familiar with the > area already, with some "resources" for further reading and a list of > required or useful skills. > I should just edit the wiki page? From justin at redwiredesign.com Mon Feb 20 10:07:09 2012 From: justin at redwiredesign.com (Justin Finkelstein) Date: Mon, 20 Feb 2012 10:07:09 +0000 Subject: [Xapian-discuss] GSoC 2012 In-Reply-To: <20120219210443.GW17351@survex.com> References: <20120216053806.GA28623@survex.com> <1329388169.23831.8.camel@justin> <20120219210443.GW17351@survex.com> Message-ID: <1329732429.25053.33.camel@justin> On Sun, 2012-02-19 at 21:04 +0000, Olly Betts wrote: > On Thu, Feb 16, 2012 at 10:29:29AM +0000, Justin Finkelstein wrote: > > There're two things I'd like to suggest that aren't on the list: > > > > 1. An queueing system, to eliminate or work around the one-writer at > > a time issue > > 2. A web service front-end, handling queries via GET, CRUD > > operations via POST containing XML > > Isn't (2) essentially what Richard's restpose (http://restpose.org/) > aims to do, except it's JSON not XML (which seems to be the modern > trend)? It certainly looks like it; I'm surprised I haven't seen this before - may I suggest a link to it from xapian.org? > > The idea being to bring Xapian a bit more in-line with some of the other > > search appliances and to make adoption easier. > > I'm not sure how these would fit into the Xapian ethos, but it's > > something I'd like to see developed. > > These seem like projects on top of Xapian to me, and that seems a > sensible separation (like how solr is a web services layer on top of > lucene). Absolutely. So we're looking for core Xapian projects. > I'm happy to include work on projects like that, but starting a new > project is potentially problematic. > > If the student is engaged enough to stay involved in the longer term, it > would work OK, but if the student doesn't hang around much after GSoC > you have an orphaned project, which isn't really good for anyone > involved. > > Also, most students will probably do better working within some sort of > existing structure rather than trying to start from a clean slate. This is all understandable and I see your points on all of these things. I wish I'd know about RestPost some time ago as it seems to make entry into using Xapian much easier. From james-xapian at tartarus.org Mon Feb 20 11:44:24 2012 From: james-xapian at tartarus.org (James Aylett) Date: Mon, 20 Feb 2012 11:44:24 +0000 Subject: [Xapian-discuss] GSoC 2012 In-Reply-To: <1329732429.25053.33.camel@justin> References: <20120216053806.GA28623@survex.com> <1329388169.23831.8.camel@justin> <20120219210443.GW17351@survex.com> <1329732429.25053.33.camel@justin> Message-ID: On 20 Feb 2012, at 10:07, Justin Finkelstein wrote: > I wish I'd know about RestPost some time ago as it seems to make entry > into using Xapian much easier. It hasn't really been around all that long :-) J -- James Aylett talktorex.co.uk - xapian.org - devfort.com From justin at redwiredesign.com Mon Feb 20 11:48:15 2012 From: justin at redwiredesign.com (Justin Finkelstein) Date: Mon, 20 Feb 2012 11:48:15 +0000 Subject: [Xapian-discuss] GSoC 2012 In-Reply-To: References: <20120216053806.GA28623@survex.com> <1329388169.23831.8.camel@justin> <20120219210443.GW17351@survex.com> <1329732429.25053.33.camel@justin> Message-ID: <1329738495.32061.1.camel@justin> On Mon, 2012-02-20 at 11:44 +0000, James Aylett wrote: > On 20 Feb 2012, at 10:07, Justin Finkelstein wrote: > > > I wish I'd know about RestPost some time ago as it seems to make entry > > into using Xapian much easier. > > > It hasn't really been around all that long :-) > > J > Aha - What's it's stability like, James? From james-xapian at tartarus.org Mon Feb 20 14:25:35 2012 From: james-xapian at tartarus.org (James Aylett) Date: Mon, 20 Feb 2012 14:25:35 +0000 Subject: [Xapian-discuss] GSoC 2012 In-Reply-To: <1329738495.32061.1.camel@justin> References: <20120216053806.GA28623@survex.com> <1329388169.23831.8.camel@justin> <20120219210443.GW17351@survex.com> <1329732429.25053.33.camel@justin> <1329738495.32061.1.camel@justin> Message-ID: <13CE5A67-9F58-4655-8B5D-0B4BF4EAC147@tartarus.org> On 20 Feb 2012, at 11:48, Justin Finkelstein wrote: >> > I wish I'd know about RestPost some time ago as it seems to make entry >> > into using Xapian much easier. > Aha - What's it's stability like, James? I'm not using it in production yet, but I haven't noticed any stability problems. J -- James Aylett talktorex.co.uk - xapian.org - devfort.com From olly at survex.com Mon Feb 20 20:27:35 2012 From: olly at survex.com (Olly Betts) Date: Mon, 20 Feb 2012 20:27:35 +0000 Subject: [Xapian-discuss] GSoC 2012 In-Reply-To: <1329732429.25053.33.camel@justin> References: <20120216053806.GA28623@survex.com> <1329388169.23831.8.camel@justin> <20120219210443.GW17351@survex.com> <1329732429.25053.33.camel@justin> Message-ID: <20120220202735.GA17351@survex.com> On Mon, Feb 20, 2012 at 10:07:09AM +0000, Justin Finkelstein wrote: > On Sun, 2012-02-19 at 21:04 +0000, Olly Betts wrote: > > > On Thu, Feb 16, 2012 at 10:29:29AM +0000, Justin Finkelstein wrote: > > > 2. A web service front-end, handling queries via GET, CRUD > > > operations via POST containing XML > > > > Isn't (2) essentially what Richard's restpose (http://restpose.org/) > > aims to do, except it's JSON not XML (which seems to be the modern > > trend)? > > It certainly looks like it; I'm surprised I haven't seen this before - > may I suggest a link to it from xapian.org? Richard is able to edit the website, so can just add one if/when he thinks it is appropriate to publicise more widely. > > These seem like projects on top of Xapian to me, and that seems a > > sensible separation (like how solr is a web services layer on top of > > lucene). > > Absolutely. So we're looking for core Xapian projects. Well, no - core projects are certainly acceptable, but projects working on things built on Xapian are also in scope. But if the project is essentially starting something entirely new, then I think we need to carefully consider its viability after the Summer. If a student works on a project adding something to an existing codebase which already has people maintaining and developing it, then the student not staying involved isn't such an issue. Cheers, Olly From xapian at networkimprov.net Wed Feb 22 19:11:49 2012 From: xapian at networkimprov.net (Liam) Date: Wed, 22 Feb 2012 11:11:49 -0800 Subject: [Xapian-discuss] GSoC 2012 In-Reply-To: References: <20120216053806.GA28623@survex.com> <20120219204200.GV17351@survex.com> Message-ID: On Sun, Feb 19, 2012 at 12:42 PM, Olly Betts wrote: > feel free to add suitable project ideas to the list if you are (or have) > someone > willing to mentor them. Try to write the idea up so that it is easy > to understand for a student who isn't intimately familiar with the > area already, with some "resources" for further reading and a list of > required or useful skills. > I should just edit the wiki page? From olly at survex.com Thu Feb 23 02:25:26 2012 From: olly at survex.com (Olly Betts) Date: Thu, 23 Feb 2012 02:25:26 +0000 Subject: [Xapian-discuss] GSoC 2012 In-Reply-To: References: <20120216053806.GA28623@survex.com> <20120219204200.GV17351@survex.com> Message-ID: <20120223022526.GD17351@survex.com> On Sun, Feb 19, 2012 at 01:26:55PM -0800, Liam wrote: > You've mentioned reading archive files in omindex before, I believe it was > your rationale for a stream-based API (vs filename) to the Mime2Text > library I've started on. So I was surprised not to see it listed. It's not intended to be an exhaustive list of every idea someone would like to see implemented. The purpose of the list is to attract students to work with us (and as a first step for Google to select us as an org because they think students will want to work with us). So we really want a sensible length list of varied project ideas which will appeal to students and will hopefully take them an appropriate amount of work to implement. And it's better if the ideas don't involve extensive changes to the same parts of the code as each other, as that means a painful merge if both projects get done. So we probably don't want one student working on indexing container formats while another works on using extraction libraries. Perhaps one student could work on a project covering both though. We also need at least one person willing to mentor each project idea (otherwise it's pointless to offer it), and ideally we don't want the same person as the only potential mentor for many ideas, as it means we can only really select one application among those ideas. > I should just edit the wiki page? Yes, or else post a draft here if you prefer. Cheers, Olly From peter at peknet.com Fri Feb 24 03:41:31 2012 From: peter at peknet.com (Peter Karman) Date: Thu, 23 Feb 2012 21:41:31 -0600 Subject: [Xapian-discuss] Dezi Message-ID: <4F4706EB.8020608@peknet.com> The recent thread about http://restpose.org/ (which looks like a cool project) reminded me that I had not yet announced the existence of Dezi[0] to this group. Dezi is a search server similar to RestPost and Apache Solr that supports OpenSearch[1] XML and JSON response types. You can index any format of document supported by SWISH::Filter[3] via a REST API. Clients are available already in Perl and PHP. The Xapian backend[2] is available via Search::OpenSearch::Engine::Xapian, which relies on the Search::Xapian Perl bindings. Contributors, users, critics all welcome. cheers, pek [0] http://dezi.org/ [1] http://www.opensearch.org/ [2] http://dezi.org/node/4 [3] https://metacpan.org/module/SWISH::Filter -- Peter Karman . http://peknet.com/ . peter at peknet.com From xapian at networkimprov.net Fri Feb 24 08:35:10 2012 From: xapian at networkimprov.net (Liam) Date: Fri, 24 Feb 2012 00:35:10 -0800 Subject: [Xapian-discuss] GSoC 2012 In-Reply-To: <20120223022526.GD17351@survex.com> References: <20120216053806.GA28623@survex.com> <20120219204200.GV17351@survex.com> <20120223022526.GD17351@survex.com> Message-ID: On Wed, Feb 22, 2012 at 6:25 PM, Olly Betts wrote: > And it's better if the ideas don't involve extensive changes to the > same parts of the code as each other, as that means a painful merge > if both projects get done. So we probably don't want one student > working on indexing container formats while another works on using > extraction libraries. Perhaps one student could work on a project > covering both though. > The Text Extraction Libraries project does of course overlap with the Mime2Text library I've drafted. I'm not clear how much the former would directly change omindex.cc. If significantly, I'd suggest that this project be done as a branch of Mime2Text instead of on a new branch. I'm happy to support that. https://github.com/networkimprov/xapian/commits/liam_mime2text-lib (BTW, it'd be great if you could take a deeper look at this sometime. I posted my draft back in November...) As for the Node.js binding, I'd suggest you add this line to the Support Another Language project: A basic Node.js binding exists but lacks many Xapian features. Extending it requires learning the V8 & Node plugin APIs. https://github.com/networkimprov/node-xapian Liam