When you try to iterate over a query set with about 0.5 million items (a few hundred megs of db storage), the memory usage can become somewhat problematic. Adding .iterator to your query set helps somewhat, but still loads the entire query result into memory. Cronjobs at YouTellMe.nl where unfortunately starting to fail. My colleague Rick came up with the following fix.
This solution chunks up the querying in bits of 1000 (by default). While this is somewhat heavier on your database (multiple queries) it seriously reduces the memory usage. Curious to hear how other django developers have worked around this problem.
import gc
def queryset_iterator(queryset, chunksize=1000):
'''
Iterate over a Django Queryset ordered by the primary key
This method loads a maximum of chunksize (default: 1000) rows in it's
memory at the same time while django normally would load all rows in it's
memory. Using the iterator() method only causes it to not preload all the
classes.
Note that the implementation of the iterator does not support ordered query sets.
'''
pk = 0
last_pk = queryset.order_by('-pk')[0].pk
queryset = queryset.order_by('pk')
while pk < last_pk:
for row in queryset.filter(pk__gt=pk)[:chunksize]:
pk = row.pk
yield row
gc.collect()
#Some Examples:
#old
MyItem.objects.all()
#better
MyItem.objects.all().iterator()
#even better
queryset_iterator(MyItem.objects.all())
Henrique responded on 04 Mar 2010 at 2:44 am #
Django’s iterator() helps on avoiding the internal cache, but I think you’re hitting a limit on either the database adaptor on the database itself. Could benefit from some DB tuning. Are you using MySQL?
Now, the problem with the snippet is that it naturally won’t work for non-auto PKs – UUIDs for instance. As such, it means it won’t work with really big or sharded databases, that can’t use auto PKs. Those are the ones that would benefit the most from burst loading result sets.
If the problem is simply memory usage and is limited to batch tasks, maybe streaming the result set to disk and consuming on a queue later on could help.
Thierry responded on 04 Mar 2010 at 11:24 am #
Hey Henrique, thanks for the input. As far as I can see its the actual reading of the resultset from the db into python’s memory which causes the problem. So the raw result from the DB, not actually Django’s orm layer.
Would probably need to adjust some things in the cursor handling within django’s orm to implement a fix like you suggest…
Oleg responded on 04 Mar 2010 at 11:26 am #
Why not to use limit function on given queryset. Like qs[from:to] it will add LIMIT OFFSET construction to your query.
Thierry responded on 04 Mar 2010 at 11:29 am #
Well the PK based method above is quite similar to using slicing.
Slicing however gives a few problems.
Something like [100000:200000] is not very nice on your DB. The PK method is less heavy.
If you use slicing though, be sure to iterate from the back of the list to the beginning. Otherwise you’ll have problems when items are being removed from the list by the process you use it for.
rick responded on 05 Mar 2010 at 1:51 am #
@Henrique: even with UUIDs this would still work, it works as long as you have some unique sortable identifier. Since UUIDs are unique they are by definition sortable so the method still works flawlessly.
The problem here was that the database client (Psycopg2) was putting all (or atleast, way too much) of the rows in memory so the application ran out of memory. For the record, we’re using Postgres 8.4.
@Oleg: using a limit is nice on the clientside, but the server doesn’t like it. When you’re using LIMIT/OFFSET queries the server has to get all the results and only return the requested ones.
So when doing `LIMIT 1 OFFSET 1000000` the server will fetch 1000001 rows and discard 1000000 of them. On a large table with a large offset this is a very slow process. That is also the reason that Google only shows the first 1000 results ;)
With this iterator your database is able to use the index on the primary key so it’s always fast to fetch the requested rows.
Django query set iterator – for really large, querysets responded on 06 Mar 2010 at 5:08 pm #
[...] solution for dealing with very large querysets in django when memory is a limiting constraint, with some nice discussion in the comments about why limit and [...]
Django Memory Error – How-to work with large databases « Harbinger's Hollow responded on 25 Jun 2010 at 8:13 pm #
[...] the Memory efficient Django Queryset Iterator, written by Thierry [...]
Christophe Pettus responded on 12 Dec 2010 at 3:17 am #
Hm, that’s interesting. Looking at the Django code (1.2.3), it appears that it is not loading the whole result set into the memory. Instead, it’s calling .fetchmany for a series of chunks, each chunk being hard-coded to be 100 rows. Is your experience different, or could the memory usage be coming from elsewhere?
Christophe Pettus responded on 14 Dec 2010 at 2:31 am #
Ah, never mind: Psycopg2 grabs the whole result set even if you are doing .fetchmany().
Very Large Result Sets in Django using PostgreSQL responded on 14 Dec 2010 at 3:10 am #
[...] also an example here of constructing an iterator that does much the same [...]
Aron Griffis responded on 11 Jan 2011 at 9:56 pm #
Thanks, Thierry. This is exactly what I need in my application. I tweaked it slightly as you can see here: http://pastebin.com/6ukwatVs (setting pk to the first primary key in the queryset rather than 0, and incrementing pk once rather than for each row).
Danilo responded on 01 Apr 2011 at 8:10 am #
Aaron: Your improvement with the first PK is good, but the pk increment using `pk += chunksize` only works if you have consecutive primary keys.
Danilo responded on 01 Apr 2011 at 8:43 am #
I added another iterator function in order to return a list of size n containing query result rows, at the same time making use of the original queryset_iterator.
Feel free to improve it on gist: https://gist.github.com/897894
Andrew responded on 21 May 2012 at 1:55 pm #
try the code below. I’ve used queryset_iterator above for many large migrations of up to a million entries (many with join tables and the like) and maintaining a database cursor with only the primary key selected (to reduce temporary table sizes) and processing in batches, as done below, is much more efficient (especially when indexing isn’t optimal). give it a go:
def queryset_iterator(queryset, batchsize = 500, gc_collect = True):
iterator = queryset.values_list(‘pk’, flat = True).distinct().iterator()
eof = False
while not eof:
primary_key_buffer = []
try:
while len(primary_key_buffer) < batchsize:
primary_key_buffer.append(iterator.next())
except StopIteration:
eof = True
for result in queryset.filter(pk__in = primary_key_buffer).iterator():
yield result
if gc_collect:
gc.collect()
Glen Robertson responded on 05 Jun 2012 at 11:12 pm #
Thanks a lot. This helped me!
(:
Memory efficient (constant) and speed optimized iteration over a large table in Django responded on 07 Jan 2013 at 11:24 am #
[...] better solution is available in this post by Thierry Schellenbach. It filters on the PK, which is way faster than offsetting (how fast [...]