130

I am going to convert a Django QuerySet to a pandas DataFrame as follows:

qs = SomeModel.objects.select_related().filter(date__year=2012)
q = qs.values('date', 'OtherField')
df = pd.DataFrame.from_records(q)

It works, but is there a more efficient way?

piRSquared
  • 285,575
  • 57
  • 475
  • 624
Franco Mariluis
  • 1,521
  • 2
  • 14
  • 15
  • Hi @FrancoMariluis, sorry about this out of topic: are you using pandas into django projects. You show graphics using "Plotting with matplotlib" via django web applications. Is a valid solution for you? Thanks. – dani herrera Jul 28 '12 at 19:09
  • Hi, for showing graphics in Django I'm using django-chartit, which works fine, but I'm thinking about using matplotlib, which would give me more flexibility – Franco Mariluis Jul 29 '12 at 00:55
  • Looks pretty straightforward, and it works. Any particular concerns? – Dmitry Shevchenko Aug 20 '12 at 05:36
  • What's wrong with the way you've got it now? Do you have a particular concern? – Burhan Khalid Aug 20 '12 at 05:46
  • This was my first (and only!) approach, but since I am fairly new to pandas, I wanted to see if there was another way, but this seems to be a good one. – Franco Mariluis Aug 21 '12 at 14:00

5 Answers5

150
import pandas as pd
import datetime
from myapp.models import BlogPost

df = pd.DataFrame(list(BlogPost.objects.all().values()))
df = pd.DataFrame(list(BlogPost.objects.filter(date__gte=datetime.datetime(2012, 5, 1)).values()))

# limit which fields
df = pd.DataFrame(list(BlogPost.objects.all().values('author', 'date', 'slug')))

The above is how I do the same thing. The most useful addition is specifying which fields you are interested in. If it's only a subset of the available fields you are interested in, then this would give a performance boost I imagine.

hobs
  • 18,473
  • 10
  • 83
  • 106
lexual
  • 46,172
  • 2
  • 15
  • 14
  • 66
    Using 'list()' seems to have been deprecated (I'm on pandas 0.12). Using `DataFrame.from_records()` works better, i.e. `df = pd.DataFrame.from_records(BlogPost.objects.all().values())`. – gregoltsov Oct 28 '13 at 01:25
  • 2
    It would be more clear if this used the names from OP question. For instance, is `BlogPost` supposed to be the same as his `SomeModel`? – Hack-R Jul 26 '17 at 18:39
  • Hi, is there a way to exclude a column you don't need in the dataframe? – Willower Nov 29 '18 at 22:33
  • Will this work on "fields" that are a `@cached_property`? – hepcat72 Jul 13 '21 at 14:07
  • Thank you, by the way it works without converting to list: `pd.DataFrame()` as of now – Ersain Jan 27 '22 at 20:48
40

Convert the queryset on values_list() will be more memory efficient than on values() directly. Since the method values() returns a queryset of list of dict (key:value pairs), values_list() only returns list of tuple (pure data). It will save about 50% memory, just need to set the column information when you call pd.DataFrame().

Method 1:
    queryset = models.xxx.objects.values("A","B","C","D")
    df = pd.DataFrame(list(queryset))  ## consumes much memory
    #df = pd.DataFrame.from_records(queryset) ## works but no much change on memory usage

Method 2:
    queryset = models.xxx.objects.values_list("A","B","C","D")
    df = pd.DataFrame(list(queryset), columns=["A","B","C","D"]) ## this will save 50% memory
    #df = pd.DataFrame.from_records(queryset, columns=["A","B","C","D"]) ##It does not work. Crashed with datatype is queryset not list.

I tested this on my project with >1 million rows data, the peak memory is reduced from 2G to 1G.

Art
  • 2,836
  • 4
  • 17
  • 34
shengyang wang
  • 401
  • 4
  • 2
  • I was testing this: if the end result is to dump it in a pandas dataframe, then you lose this efficiency. And in fact, keeping it as ".values_list()" and throwing it into pandas using ".from_records()" is much faster for the same memory – Anthony M Jul 03 '22 at 06:10
32

Django Pandas solves this rather neatly: https://github.com/chrisdev/django-pandas/

From the README:

class MyModel(models.Model):
    full_name = models.CharField(max_length=25)
    age = models.IntegerField()
    department = models.CharField(max_length=3)
    wage = models.FloatField()

from django_pandas.io import read_frame
qs = MyModel.objects.all()
df = read_frame(qs)
David Watson
  • 3,394
  • 2
  • 36
  • 51
  • 15
    How does Django Pandas deal with large datasets? https://github.com/chrisdev/django-pandas/blob/master/django_pandas/io.py#L107 This line scares me, because I think it means the whole dataset will be loaded into memory at once. – Adam Barnes Nov 04 '16 at 11:06
  • @Ada To create a DataFrame using specified field names:`df = read_frame(qs, fieldnames=['age', 'wage', 'full_name'])` – Gathide May 04 '20 at 04:26
  • 2
    For those of you in this wonderous future who are wondering wth I was on about, here's a more permanent link to the source at the time: https://github.com/chrisdev/django-pandas/blob/bce1d59a552c81c91fb8aea7f522fde4a83d0159/django_pandas/io.py#L107 – Adam Barnes May 06 '20 at 00:15
  • django pandas able to handle `many-to-many` fields. – Gathide Jul 13 '21 at 10:11
2

From the Django perspective (I'm not familiar with pandas) this is fine. My only concern is that if you have a very large number of records, you may run into memory problems. If this were the case, something along the lines of this memory efficient queryset iterator would be necessary. (The snippet as written might require some rewriting to allow for your smart use of .values()).

David Eyk
  • 12,171
  • 11
  • 63
  • 103
  • @GregoryGoltsov's idea to use `.from_records()` and not using `list()` will eliminate the memory efficiency concern. – hobs Dec 16 '14 at 22:29
  • 1
    The memory efficiency concern is on the Django side. [`.values()`](https://docs.djangoproject.com/en/1.7/ref/models/querysets/#django.db.models.query.QuerySet.values) returns a `ValuesQuerySet` which caches results, so for a large enough dataset, it's going to be quite memory-intensive. – David Eyk Dec 17 '14 at 22:41
  • 1
    Ahh yes. You'd have to index into the queryset *and* use `.from_records` without the list comprehension to eliminate both memory hogs. e.g. `pd.DataFrame.from_records(qs[i].__dict__ for i in range(qs.count()))`. But you're left with that annoying `"_state"` column when you're done. `qs.values()[i]` is much faster and cleaner, but I think it caches. – hobs Dec 17 '14 at 23:02
2

You maybe can use model_to_dict

import datetime
from django.forms import model_to_dict
pallobjs = [ model_to_dict(pallobj) for pallobj in PalletsManag.objects.filter(estado='APTO_PARA_VENTA')] 
df = pd.DataFrame(pallobjs)
df.head()
Pjl
  • 1,752
  • 18
  • 21