<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Django query set iterator &#8211; for really large, querysets</title>
	<atom:link href="http://www.mellowmorning.com/2010/03/03/django-query-set-iterator-for-really-large-querysets/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.mellowmorning.com/2010/03/03/django-query-set-iterator-for-really-large-querysets/</link>
	<description>Blogging the world of IT and Business</description>
	<lastBuildDate>Sun, 29 Jan 2012 20:03:56 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
		<item>
		<title>By: Danilo</title>
		<link>http://www.mellowmorning.com/2010/03/03/django-query-set-iterator-for-really-large-querysets/comment-page-1/#comment-8574</link>
		<dc:creator>Danilo</dc:creator>
		<pubDate>Fri, 01 Apr 2011 08:43:42 +0000</pubDate>
		<guid isPermaLink="false">http://www.mellowmorning.com/?p=135#comment-8574</guid>
		<description>I added another iterator function in order to return a list of size n containing query result rows, at the same time making use of the original queryset_iterator.

Feel free to improve it on gist: https://gist.github.com/897894</description>
		<content:encoded><![CDATA[<p>I added another iterator function in order to return a list of size n containing query result rows, at the same time making use of the original queryset_iterator.</p>
<p>Feel free to improve it on gist: <a href="https://gist.github.com/897894" rel="nofollow">https://gist.github.com/897894</a></p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Danilo</title>
		<link>http://www.mellowmorning.com/2010/03/03/django-query-set-iterator-for-really-large-querysets/comment-page-1/#comment-8573</link>
		<dc:creator>Danilo</dc:creator>
		<pubDate>Fri, 01 Apr 2011 08:10:59 +0000</pubDate>
		<guid isPermaLink="false">http://www.mellowmorning.com/?p=135#comment-8573</guid>
		<description>Aaron: Your improvement with the first PK is good, but the pk increment using `pk += chunksize` only works if you have consecutive primary keys.</description>
		<content:encoded><![CDATA[<p>Aaron: Your improvement with the first PK is good, but the pk increment using `pk += chunksize` only works if you have consecutive primary keys.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Aron Griffis</title>
		<link>http://www.mellowmorning.com/2010/03/03/django-query-set-iterator-for-really-large-querysets/comment-page-1/#comment-8536</link>
		<dc:creator>Aron Griffis</dc:creator>
		<pubDate>Tue, 11 Jan 2011 21:56:05 +0000</pubDate>
		<guid isPermaLink="false">http://www.mellowmorning.com/?p=135#comment-8536</guid>
		<description>Thanks, Thierry. This is exactly what I need in my application. I tweaked it slightly as you can see here: http://pastebin.com/6ukwatVs (setting pk to the first primary key in the queryset rather than 0, and incrementing pk once rather than for each row).</description>
		<content:encoded><![CDATA[<p>Thanks, Thierry. This is exactly what I need in my application. I tweaked it slightly as you can see here: <a href="http://pastebin.com/6ukwatVs" rel="nofollow">http://pastebin.com/6ukwatVs</a> (setting pk to the first primary key in the queryset rather than 0, and incrementing pk once rather than for each row).</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Very Large Result Sets in Django using PostgreSQL</title>
		<link>http://www.mellowmorning.com/2010/03/03/django-query-set-iterator-for-really-large-querysets/comment-page-1/#comment-8513</link>
		<dc:creator>Very Large Result Sets in Django using PostgreSQL</dc:creator>
		<pubDate>Tue, 14 Dec 2010 03:10:20 +0000</pubDate>
		<guid isPermaLink="false">http://www.mellowmorning.com/?p=135#comment-8513</guid>
		<description>[...] also an example here of constructing an iterator that does much the same [...]</description>
		<content:encoded><![CDATA[<p>[...] also an example here of constructing an iterator that does much the same [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Christophe Pettus</title>
		<link>http://www.mellowmorning.com/2010/03/03/django-query-set-iterator-for-really-large-querysets/comment-page-1/#comment-8512</link>
		<dc:creator>Christophe Pettus</dc:creator>
		<pubDate>Tue, 14 Dec 2010 02:31:07 +0000</pubDate>
		<guid isPermaLink="false">http://www.mellowmorning.com/?p=135#comment-8512</guid>
		<description>Ah, never mind: Psycopg2 grabs the whole result set even if you are doing .fetchmany().</description>
		<content:encoded><![CDATA[<p>Ah, never mind: Psycopg2 grabs the whole result set even if you are doing .fetchmany().</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Christophe Pettus</title>
		<link>http://www.mellowmorning.com/2010/03/03/django-query-set-iterator-for-really-large-querysets/comment-page-1/#comment-8509</link>
		<dc:creator>Christophe Pettus</dc:creator>
		<pubDate>Sun, 12 Dec 2010 03:17:20 +0000</pubDate>
		<guid isPermaLink="false">http://www.mellowmorning.com/?p=135#comment-8509</guid>
		<description>&lt;blockquote&gt;Adding .iterator to your query set helps somewhat, but still loads the entire query result into memory.&lt;/blockquote&gt;

Hm, that&#039;s interesting. Looking at the Django code (1.2.3), it appears that it is not loading the whole result set into the memory. Instead, it&#039;s calling .fetchmany for a series of chunks, each chunk being hard-coded to be 100 rows. Is your experience different, or could the memory usage be coming from elsewhere?</description>
		<content:encoded><![CDATA[<blockquote><p>Adding .iterator to your query set helps somewhat, but still loads the entire query result into memory.</p></blockquote>
<p>Hm, that&#8217;s interesting. Looking at the Django code (1.2.3), it appears that it is not loading the whole result set into the memory. Instead, it&#8217;s calling .fetchmany for a series of chunks, each chunk being hard-coded to be 100 rows. Is your experience different, or could the memory usage be coming from elsewhere?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Django Memory Error &#8211; How-to work with large databases &#171; Harbinger&#39;s Hollow</title>
		<link>http://www.mellowmorning.com/2010/03/03/django-query-set-iterator-for-really-large-querysets/comment-page-1/#comment-8160</link>
		<dc:creator>Django Memory Error &#8211; How-to work with large databases &#171; Harbinger&#39;s Hollow</dc:creator>
		<pubDate>Fri, 25 Jun 2010 20:13:28 +0000</pubDate>
		<guid isPermaLink="false">http://www.mellowmorning.com/?p=135#comment-8160</guid>
		<description>[...] the  Memory efficient Django Queryset Iterator, written by Thierry [...]</description>
		<content:encoded><![CDATA[<p>[...] the  Memory efficient Django Queryset Iterator, written by Thierry [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Django query set iterator – for really large, querysets</title>
		<link>http://www.mellowmorning.com/2010/03/03/django-query-set-iterator-for-really-large-querysets/comment-page-1/#comment-7929</link>
		<dc:creator>Django query set iterator – for really large, querysets</dc:creator>
		<pubDate>Sat, 06 Mar 2010 17:08:02 +0000</pubDate>
		<guid isPermaLink="false">http://www.mellowmorning.com/?p=135#comment-7929</guid>
		<description>[...] solution for dealing with very large querysets in django when memory is a limiting constraint, with some nice discussion in the comments about why limit and [...]</description>
		<content:encoded><![CDATA[<p>[...] solution for dealing with very large querysets in django when memory is a limiting constraint, with some nice discussion in the comments about why limit and [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: rick</title>
		<link>http://www.mellowmorning.com/2010/03/03/django-query-set-iterator-for-really-large-querysets/comment-page-1/#comment-7927</link>
		<dc:creator>rick</dc:creator>
		<pubDate>Fri, 05 Mar 2010 01:51:09 +0000</pubDate>
		<guid isPermaLink="false">http://www.mellowmorning.com/?p=135#comment-7927</guid>
		<description>@Henrique: even with UUIDs this would still work, it works as long as you have some unique sortable identifier. Since UUIDs are unique they are by definition sortable so the method still works flawlessly.

The problem here was that the database client (Psycopg2) was putting all (or atleast, way too much) of the rows in memory so the application ran out of memory. For the record, we&#039;re using Postgres 8.4.

@Oleg: using a limit is nice on the clientside, but the server doesn&#039;t like it. When you&#039;re using LIMIT/OFFSET queries the server has to get all the results and only return the requested ones.

So when doing `LIMIT 1 OFFSET 1000000` the server will fetch 1000001 rows  and discard 1000000 of them. On a large table with a large offset this is a very slow process. That is also the reason that Google only shows the first 1000 results ;)

With this iterator your database is able to use the index on the primary key so it&#039;s always fast to fetch the requested rows.</description>
		<content:encoded><![CDATA[<p>@Henrique: even with UUIDs this would still work, it works as long as you have some unique sortable identifier. Since UUIDs are unique they are by definition sortable so the method still works flawlessly.</p>
<p>The problem here was that the database client (Psycopg2) was putting all (or atleast, way too much) of the rows in memory so the application ran out of memory. For the record, we&#8217;re using Postgres 8.4.</p>
<p>@Oleg: using a limit is nice on the clientside, but the server doesn&#8217;t like it. When you&#8217;re using LIMIT/OFFSET queries the server has to get all the results and only return the requested ones.</p>
<p>So when doing `LIMIT 1 OFFSET 1000000` the server will fetch 1000001 rows  and discard 1000000 of them. On a large table with a large offset this is a very slow process. That is also the reason that Google only shows the first 1000 results ;)</p>
<p>With this iterator your database is able to use the index on the primary key so it&#8217;s always fast to fetch the requested rows.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Thierry</title>
		<link>http://www.mellowmorning.com/2010/03/03/django-query-set-iterator-for-really-large-querysets/comment-page-1/#comment-7924</link>
		<dc:creator>Thierry</dc:creator>
		<pubDate>Thu, 04 Mar 2010 11:29:57 +0000</pubDate>
		<guid isPermaLink="false">http://www.mellowmorning.com/?p=135#comment-7924</guid>
		<description>Well the PK based method above is quite similar to using slicing.
Slicing however gives a few problems.

Something like [100000:200000] is not very nice on your DB. The PK method is less heavy.
If you use slicing though, be sure to iterate from the back of the list to the beginning. Otherwise you&#039;ll have problems when items are being removed from the list by the process you use it for.</description>
		<content:encoded><![CDATA[<p>Well the PK based method above is quite similar to using slicing.<br />
Slicing however gives a few problems.</p>
<p>Something like [100000:200000] is not very nice on your DB. The PK method is less heavy.<br />
If you use slicing though, be sure to iterate from the back of the list to the beginning. Otherwise you&#8217;ll have problems when items are being removed from the list by the process you use it for.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Oleg</title>
		<link>http://www.mellowmorning.com/2010/03/03/django-query-set-iterator-for-really-large-querysets/comment-page-1/#comment-7923</link>
		<dc:creator>Oleg</dc:creator>
		<pubDate>Thu, 04 Mar 2010 11:26:14 +0000</pubDate>
		<guid isPermaLink="false">http://www.mellowmorning.com/?p=135#comment-7923</guid>
		<description>Why not to use limit function on given queryset. Like qs[from:to] it will add LIMIT OFFSET construction to your query.</description>
		<content:encoded><![CDATA[<p>Why not to use limit function on given queryset. Like qs[from:to] it will add LIMIT OFFSET construction to your query.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Thierry</title>
		<link>http://www.mellowmorning.com/2010/03/03/django-query-set-iterator-for-really-large-querysets/comment-page-1/#comment-7922</link>
		<dc:creator>Thierry</dc:creator>
		<pubDate>Thu, 04 Mar 2010 11:24:03 +0000</pubDate>
		<guid isPermaLink="false">http://www.mellowmorning.com/?p=135#comment-7922</guid>
		<description>Hey Henrique, thanks for the input. As far as I can see its the actual reading of the resultset from the db into python&#039;s memory which causes the problem. So the raw result from the DB, not actually Django&#039;s orm layer.

Would probably need to adjust some things in the cursor handling within django&#039;s orm to implement a fix like you suggest...</description>
		<content:encoded><![CDATA[<p>Hey Henrique, thanks for the input. As far as I can see its the actual reading of the resultset from the db into python&#8217;s memory which causes the problem. So the raw result from the DB, not actually Django&#8217;s orm layer.</p>
<p>Would probably need to adjust some things in the cursor handling within django&#8217;s orm to implement a fix like you suggest&#8230;</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Henrique</title>
		<link>http://www.mellowmorning.com/2010/03/03/django-query-set-iterator-for-really-large-querysets/comment-page-1/#comment-7921</link>
		<dc:creator>Henrique</dc:creator>
		<pubDate>Thu, 04 Mar 2010 02:44:43 +0000</pubDate>
		<guid isPermaLink="false">http://www.mellowmorning.com/?p=135#comment-7921</guid>
		<description>Django&#039;s iterator() helps on avoiding the internal cache, but I think you&#039;re hitting a limit on either the database adaptor on the database itself. Could benefit from some DB tuning. Are you using MySQL?

Now, the problem with the snippet is that it naturally won&#039;t work for non-auto PKs - UUIDs for instance. As such, it means it won&#039;t work with really big or sharded databases, that can&#039;t use auto PKs. Those are the ones that would benefit the most from burst loading result sets.

If the problem is simply memory usage and is limited to batch tasks, maybe streaming the result set to disk and consuming on a queue later on could help.</description>
		<content:encoded><![CDATA[<p>Django&#8217;s iterator() helps on avoiding the internal cache, but I think you&#8217;re hitting a limit on either the database adaptor on the database itself. Could benefit from some DB tuning. Are you using MySQL?</p>
<p>Now, the problem with the snippet is that it naturally won&#8217;t work for non-auto PKs &#8211; UUIDs for instance. As such, it means it won&#8217;t work with really big or sharded databases, that can&#8217;t use auto PKs. Those are the ones that would benefit the most from burst loading result sets.</p>
<p>If the problem is simply memory usage and is limited to batch tasks, maybe streaming the result set to disk and consuming on a queue later on could help.</p>
]]></content:encoded>
	</item>
</channel>
</rss>

