When you try to iterate over a query set with about 0.5 million items (a few hundred megs of db storage), the memory usage can become somewhat problematic. Adding .iterator to your query set helps somewhat, but still loads the entire query result into memory. Cronjobs at YouTellMe.nl where unfortunately starting to fail. My colleague Rick came up with the following fix.
This solution chunks up the querying in bits of 1000 (by default). While this is somewhat heavier on your database (multiple queries) it seriously reduces the memory usage. Curious to hear how other django developers have worked around this problem.
import gc
def queryset_iterator(queryset, chunksize=1000):
'''
Iterate over a Django Queryset ordered by the primary key
This method loads a maximum of chunksize (default: 1000) rows in it's
memory at the same time while django normally would load all rows in it's
memory. Using the iterator() method only causes it to not preload all the
classes.
Note that the implementation of the iterator does not support ordered query sets.
'''
pk = 0
last_pk = queryset.order_by('-pk')[0].pk
queryset = queryset.order_by('pk')
while pk < last_pk:
for row in queryset.filter(pk__gt=pk)[:chunksize]:
pk = row.pk
yield row
gc.collect()
#Some Examples:
#old
MyItem.objects.all()
#better
MyItem.objects.all().iterator()
#even better
queryset_iterator(MyItem.objects.all())

Henrique responded on 04 Mar 2010 at 2:44 am #
Django’s iterator() helps on avoiding the internal cache, but I think you’re hitting a limit on either the database adaptor on the database itself. Could benefit from some DB tuning. Are you using MySQL?
Now, the problem with the snippet is that it naturally won’t work for non-auto PKs – UUIDs for instance. As such, it means it won’t work with really big or sharded databases, that can’t use auto PKs. Those are the ones that would benefit the most from burst loading result sets.
If the problem is simply memory usage and is limited to batch tasks, maybe streaming the result set to disk and consuming on a queue later on could help.
Thierry responded on 04 Mar 2010 at 11:24 am #
Hey Henrique, thanks for the input. As far as I can see its the actual reading of the resultset from the db into python’s memory which causes the problem. So the raw result from the DB, not actually Django’s orm layer.
Would probably need to adjust some things in the cursor handling within django’s orm to implement a fix like you suggest…
Oleg responded on 04 Mar 2010 at 11:26 am #
Why not to use limit function on given queryset. Like qs[from:to] it will add LIMIT OFFSET construction to your query.
Thierry responded on 04 Mar 2010 at 11:29 am #
Well the PK based method above is quite similar to using slicing.
Slicing however gives a few problems.
Something like [100000:200000] is not very nice on your DB. The PK method is less heavy.
If you use slicing though, be sure to iterate from the back of the list to the beginning. Otherwise you’ll have problems when items are being removed from the list by the process you use it for.
rick responded on 05 Mar 2010 at 1:51 am #
@Henrique: even with UUIDs this would still work, it works as long as you have some unique sortable identifier. Since UUIDs are unique they are by definition sortable so the method still works flawlessly.
The problem here was that the database client (Psycopg2) was putting all (or atleast, way too much) of the rows in memory so the application ran out of memory. For the record, we’re using Postgres 8.4.
@Oleg: using a limit is nice on the clientside, but the server doesn’t like it. When you’re using LIMIT/OFFSET queries the server has to get all the results and only return the requested ones.
So when doing `LIMIT 1 OFFSET 1000000` the server will fetch 1000001 rows and discard 1000000 of them. On a large table with a large offset this is a very slow process. That is also the reason that Google only shows the first 1000 results ;)
With this iterator your database is able to use the index on the primary key so it’s always fast to fetch the requested rows.
Django query set iterator – for really large, querysets responded on 06 Mar 2010 at 5:08 pm #
[...] solution for dealing with very large querysets in django when memory is a limiting constraint, with some nice discussion in the comments about why limit and [...]
Django Memory Error – How-to work with large databases « Harbinger's Hollow responded on 25 Jun 2010 at 8:13 pm #
[...] the Memory efficient Django Queryset Iterator, written by Thierry [...]