2

So I have two 200mb JSON files. The first one takes 1.5 hours to load, and the second which (which makes a bunch of many-to-many relationships models with the first), takes 24+ hours (since there's no updates via the console, I have no clue if it was still going or if it froze, so I stopped it).

Since the loaddata wasn't working that well, I wrote my own script that loaded the data while also outputting what's been recently saved into the db, but I noticed the speed of the script (along with my computer) decayed the longer it went. So I had to stop the script -> restart my computer -> resume at the section of data where I left off, and that would be faster than running the script throughout. This was a tedious process since it took roughly 18 hrs with me restarting the computer every 4 hours to get all the data fully loaded.

I'm wondering if there is a better solution for loading in large amounts of data?

EDIT: I realized there's an option to load in raw SQL, so I may try that, although I need to brush up on my SQL.

dl8
  • 1,270
  • 1
  • 14
  • 34

1 Answers1

4

When you're loading large amounts of data, writing your own custom script is generally the fastest. Once you've got it loaded in once, you can use your databases import/export options, which will generally be very fast (ex, pgdump).

When you are writing your own script, though, two things which will drastically speed things up:

  1. Loading data inside a transaction. By default the database is likely in autocommit mode, which causes an expensive commit after each insert. Instead, make sure that you begin a transaction before you insert anything, then commit it afterwards (importantly, though, don't forget to commit; nothing sucks like spending three hours importing data, only to realize you forgot to commit it).
  2. Bypassing the Django ORM and use raw INSERT statements. There is some computational overhead to the ORM, and bypassing it will make things faster.
David Wolever
  • 148,955
  • 89
  • 346
  • 502
  • 3
    Well, I've tried loading in data with manual commits and transactions and it's still rather slow. Guess I'm going to have to go the raw sql route. – dl8 Nov 03 '13 at 23:23