6

I need to upload a lot of data to a MySQL db. For most models I use django's ORM, but one of my models will have billions (!) of instances and I would like to optimize its insert operation.

I can't seem to find a way to make executemany() work, and after googling it seems there are almost no examples out there.

I'm looking for the correct sql syntax + correct command syntax + correct values data structure to support an executemany command for the following sql statement:

INSERT INTO `some_table` (`int_column1`, `float_column2`, `string_column3`, `datetime_column4`) VALUES (%d, %f, %s, %s)

Yes, I'm explicitly stating the id (int_column1) for efficiency.

A short example code would be great

Jonathan Livni
  • 101,334
  • 104
  • 266
  • 359

4 Answers4

19

Here's a solution that actually uses executemany() !

Basically the idea in the example here will work.

But note that in Django, you need to use the %s placeholder rather than the question mark.

Also, you will want to manage your transactions. I'll not get into that here as there is plenty of documentation available.

    from django.db import connection,transaction
    cursor = connection.cursor()
    
    
    
    query = ''' INSERT INTO table_name 
            (var1,var2,var3) 
            VALUES (%s,%s,%s) '''
    
    
    query_list = build_query_list() 
    
    # here build_query_list() represents some function to populate
    # the list with multiple records
    # in the tuple format (value1, value2, value3).
    
    
    cursor.executemany(query, query_list)
    
    transaction.commit()

boxed
  • 3,895
  • 2
  • 24
  • 26
jsh
  • 1,995
  • 18
  • 28
  • Half a year late, but finally an answer regarding executemany(). I haven't tested your answer, but I'm accepting it and hoping your code works – Jonathan Livni May 24 '11 at 15:26
  • I can't find documentation for a function buildQueryList() anywhere. Is it part of Django? – Trindaz Aug 18 '11 at 04:54
  • if you read the comments...it's up to YOU to build that function based on your application's specific needs / database schema... #here buildQueryList() represents some function to populate #the list with multiple records #in the tuple format (value1,value2,value3). – jsh Aug 18 '11 at 15:23
  • 1
    For future people looking for the same thing: jsh's snippet works. Thanks – Arthur Nov 23 '11 at 19:17
  • 1
    Note that if you are using MySQLdb 1.2 you'll run into a very unfortunate performance regression. The workaround is to use lowercase "values". Seriously. See http://stackoverflow.com/questions/3945642/why-is-executemany-slow-in-python-mysqldb – Paul Du Bois Mar 27 '12 at 01:57
  • +1 for a good answer, but how about some pythonic variable and function names? `query_list` instead of `queryList` and `build_query_list()` instead of `buildQueryList()`. – MrOodles May 31 '12 at 17:46
1

are you serisouly suggesting loading billions of rows (sorry instances) of data via some ORM data access layer - how long do you have ?

bulk load if possible - http://dev.mysql.com/doc/refman/5.1/en/load-data.html

Jon Black
  • 16,223
  • 5
  • 43
  • 42
  • I think you missed where I wrote I'm using ORM for most models (which have only a few hundred of instances) and using raw SQL for the single model who has billions of instances, but the link is helpful thanks – Jonathan Livni Nov 28 '10 at 19:35
  • Also, I can't really use load-data, as I do manipulate the data before I enter it to the db... – Jonathan Livni Nov 28 '10 at 19:39
  • Lastly, yes, it's that important, even if it takes a few days to insert to the db. It's a one time operation (until the next time...) – Jonathan Livni Nov 28 '10 at 19:41
  • why dont you write a Dump2CSV method which processs your instances and creates a series of .dat files which you can load into your db ? – Jon Black Nov 28 '10 at 22:29
  • So what I guess you're saying is that no matter what the modification on the data is, I should separate the loading process from the data manipulation part and load using load-data – Jonathan Livni Nov 30 '10 at 11:06
  • if you have doubts perhaps run a few benchmarks with say 1 million rows loaded via django and the same using load data infile then work out how long it would take to load 1000 million rows... – Jon Black Dec 01 '10 at 10:11
  • I ran some benchmarks - separating the pre-processing and the loading almost doubled the loading time as most of the time was spent on hard drive access – Jonathan Livni May 24 '11 at 15:24
1

If you need to modify the data, bulk load with load data into a temporary table as is. Then apply modifications with an insert into select command. IME, this is by far the fastest way to get a lot of data into a table.

nate c
  • 8,802
  • 2
  • 27
  • 28
0

I'm not sure how to use the executemany() command, but you can use a single SQL INSERT statement to insert multiple records

Jonathan Livni
  • 101,334
  • 104
  • 266
  • 359