4

I'm trying to provide initial data using 2 sets of fixtures. The first fixture format looks like this.

  {
    "pk": 1,
    "model": "data.Person",
    "fields": {
      "full": "Anna-Varney",
      "num": "I",
      "short": "Anna-Varney"
    }
  },

And I load it in first, and it loads in fine in roughly 1-2 hours. My movie.json format looks like this:

  {
    "pk": 1,
    "model": "data.Film",
    "fields": {
      "date": "2005-08-01",
      "rating": 8.3,
      "actors": [
        [
          "Anna-Varney"
        ]
      ],
      "name": "Like a Corpse Standing in Desperation (2005) (V)"
    }
  },

And loading the movies fixture in has taken an extremely long time, it's currently 20 hrs in and my computer is sluggish while it is running. I loaded similar fixtures 2 months ago, except I used MySQL (I'm using Postgres now) and that I've added the date field in my model. When loading the movies fixture into my old MySQL database in the past, it only took 2-3 hours. Is there a way to determine what step the fixture loading part is in or if it has frozen?

For reference my models are:

class PersonManager(models.Manager):
    def get_by_natural_key(self, full):
        return self.get(full=full)

class Person(models.Model):
    objects = PersonManager()
    full = models.CharField(max_length=100,unique = True)
    short = models.CharField(max_length=100)
    num = models.CharField(max_length=5)
    def natural_key(self):
        return (self.full,)

    def __unicode__(self):
        return self.full


class Film(models.Model):
    name = models.TextField()
    date = models.DateField()
    rating = models.DecimalField(max_digits=3 , decimal_places=1)
    actors = models.ManyToManyField('Person')

    def __unicode__(self):
        return self.name
dl8
  • 1,270
  • 1
  • 14
  • 34
  • The first thing I'd try is creating an index on `full`. The natural key handling has to find the correct `Person` instance for each film, so it's repeatedly searching on `full`. – Peter DeGlopper Oct 10 '13 at 22:39
  • Okay I will try that. If my app heavily uses twitter's typeahead.js with a view that calls an istartswith for the short, should I also make the short an index? – dl8 Oct 10 '13 at 23:52
  • I'm not sure about that - it would depend on how exactly the `istartswith` query gets sent to Postgres. – Peter DeGlopper Oct 10 '13 at 23:57

3 Answers3

2

For most cases you can speed things up a lot by loading your dumped data programmatically and using bulk_create

Example:

from collections import defaultdict
from django.core import serializers                                                                     

obj_dict = defaultdict(list)
deserialized = serializers.deserialize('json', open('my_fixtures.json'))
# organize by model class
for item in deserialized:
  obj = item.object
  obj_dict[obj.__class__].append(obj) 

for cls, objs in obj_dict.items():
  cls.objects.bulk_create(objs)
jlivni
  • 4,759
  • 1
  • 19
  • 30
  • This definitely speeds things up, but I noticed that many-to-many relationships are not being transferred over. It seems upon deserialization, the many-to-many relation which is represented as an array becomes null. Any ideas? – kyldu Mar 25 '22 at 00:49
2

If you are loading your fixtures via the command line:

python manage.py loaddata --database=MY_DB_LABEL fixtures/my_fixture.json;

or perhaps programmatically through the shell:

os.system('python manage.py loaddata --database=%s fixtures/my_fixture.json;' % MY_DB_LABEL)

Fixture loading will be SLOW. (I have not investigated why. Presumably, there are many unnecessary intermediate database saves being made.)


SOLUTION: Switch to loading your fixtures programatically via python using a single transaction:

from django.db import transaction
from django.core.management import call_command

with transaction.atomic(using=MY_DB_LABEL):
    call_command('loaddata', 'fixtures/my_fixture.json', database=MY_DB_LABEL)
    call_command('loaddata', 'fixtures/my_other_fixture.json', database=MY_DB_LABEL)

The fixture loading will speed up DRAMATICALLY.


Note that the database and using parameters here are optional. If you are using a single database, they are unnecessary. But if you are using multiple databases like me, you will probably want to use it to ensure which database the fixture data is loaded into.

Todd Ditchendorf
  • 11,217
  • 14
  • 69
  • 123
1

Because Django runs in autocommit mode it asks a database to be really sure that after every single object is created then it would be immediately saved and synced to a physical location on a drive platter. This limits the number of objects saved to the speed of disk platters.

You need to use @transaction.atomic decorator or with transaction.atomic(): context manager to allow database to make sure everything is saved safely only once - at the end.

You can read more about transactions in Django documentation.

I'd even recommend setting ATOMIC_REQUESTS to True in database configuration when using PostgreSQL with Django. This way every browser request will automatically be served in one transaction and commited only if a resulting view would be successfully run.

Tometzky
  • 22,573
  • 5
  • 59
  • 73
  • 1
    Where exactly would I apply these to? The transactions documentation mainly applies them to views, but my issue right now is just trying to load the data into the database via fixtures in the first place. – dl8 Oct 11 '13 at 00:55
  • I don't know Django very well, but I don't think fixtures are meant for bulk data loading. I think they're saving objects just as I described - one at a time. Maybe setting this ATOMIC_REQUESTS to True in database configuration will make them run in one transaction, and speed it up. – Tometzky Oct 11 '13 at 05:26
  • 1
    I added roughly the same amount of data a month ago except I used MySQL and very slightly different models. The fixtures loaded fine. For some reason changing to Postgres and tweaking the models a bit makes it load very slowly (or freeze while loading). And I don't think ATOMIC_REQUESTS will do anything since that's tied with HTTP Requests, and I'm not dealing with any of that right now, I'm simply trying to load initial data into my database. – dl8 Oct 11 '13 at 09:42
  • If you can change Postgres parameters in `postgresql.conf` then you can try to set `synchronous_commit` option to `off` and restart database. You're risking loosing some transactions (by default about 1 second) in case of power failure, but will have much faster commits. – Tometzky Oct 11 '13 at 12:42