I have tested my database using temporary filesystem /dev/shm, which can be compared to RAMDISK.<br>
<br>
Performances seems really a lot better ... as long as I don't use sorting ! <br>
With sorting, even with this filesystem, I can have >10seconds search time. <br>
I am so disappointed with this issue.<br>
Anyone having the same needs as me ?<br>
<br>
If it's not the hardware, so that's the software, or configuration maybe.<br>
<br>
Do you know if Lucene uses the same mecanism to sort results ? <br>
<br>
Regards<br>
<br><br><div><span class="gmail_quote">On 2/20/06, <b class="gmail_sendername">Olly Betts</b> <<a href="mailto:olly@survex.com">olly@survex.com</a>> wrote:</span><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
On Mon, Feb 20, 2006 at 12:05:23PM +0200, David Levy wrote:<br>> > But sorting as currently designed does need to process every matching<br>> > document, which is going to be slow for a large database if the query
<br>> > matches a lot of documents.<br>><br>> Will this mecanism change in future releases ?<br><br>It's possible there's a better way to handle it. If we came up with a<br>workable scheme and somebody implemented it then we'd have a different
<br>mechanism. So it might change, but it's not something I'm currently<br>working on or actively planning to.<br><br>The problem is that you really want to process the documents in sorted<br>order, as you can then just stop once you've filled the MSet. You could
<br>list the document ids in ranked order for each sortable value (it would<br>take a fair amount of space), but then all the posting lists<br>list documents in id order, so you can't easily process documents in<br>sorted order even though you would then know that order. You could
<br>try to visit the docids in the order by random-access like seeking<br>into posting lists. That would work OK if the top N items all made<br>it into the MSet, but at some point it'll become less efficient...<br><br>But it looks like this isn't currently the bottleneck.
<br><br>> I have compacted and removed large fields in the index. So the database is<br>> half the size ... but performance are still slow.<br>> I am thinking about using "ramdisks" maybe; and I am checking my hard disks
<br>> too.<br>> Did you used ramdisks with Xapian yet ? Does it help ?<br><br>The VM system in a modern Unix-like OS will cache blocks recently read<br>from disk. This dynamic caching is probably going to do as well as
<br>trying to force parts of the database into RAM. By all means give it<br>a try, but I doubt it's a magic bullet.<br><br>> > But even now, "sort by date" is still acceptably fast on 30 million<br>> > documents, which points the finger strongly towards accessing the values
<br>> > as taking most of the time.<br>><br>> How was do you mean ?<br>> I was bad results with < 1M documents :<br><br>I mean "sort by date" is acceptably fast *on gmane*, which doesn't use<br>
sorting on values, but still has to trawl through the whole of each<br>posting list in this case. That strongly suggests that the bottleneck<br>is currently with getting at the values to do the sorting.<br><br>> However, I used the "collapse" parameter .. Is it time consuming even it
<br>> there are no records to collapse in the results ?<br><br>Collapsing still needs to read the values, even if they are unique. So<br>if collapsing is also slow, that further points the finger at the<br>storage of the values.
<br><br>Cheers,<br> Olly<br></blockquote></div><br><br clear="all"><br>-- <br>David LEVY {selenium}<br>Website ~ <a href="http://www.davidlevy.org">http://www.davidlevy.org</a><br>Wishlist Zlio ~ <a href="http://david.zlio.com/wishlist">
http://david.zlio.com/wishlist</a><br>Blog ~ <a href="http://selenium.blogspot.com">http://selenium.blogspot.com</a><br>