3

I need to be able to quickly bulk insert large amounts of records quickly, while still ensuring uniqueness in the database. The new records to be inserted have already been parsed, and are unique. I'm hoping there is a way to enforce uniqueness at the database level, and not in the code itself.

I'm using MySQL as the database backend. If django supports this functionality in any other database, I am flexible in changing the backend, as this is a requirement.

Bulk inserts in Django don't use the save method, so how can I insert several hundred to several thousand records at a time, while still respecting unique fields and unique together fields?


My model structures, simplified, look something like this:

class Example(models.Model):
    Meta:
        unique_together = (('name', 'number'),)

    name = models.CharField(max_length = 50)
    number = models.CharField(max_length = 10)
    ...
    fk = models.ForeignKey(OtherModel)

Edit:

The records that aren't already in the database should be inserted, and the records that already existed should be ignored.

NickCSE
  • 305
  • 1
  • 4
  • 14
  • What form is your original data in? If it is a csv, I'm wondering if you could just loop through each entry and save it that way. – djq Mar 07 '13 at 04:09
  • [`get_or_create`](https://docs.djangoproject.com/en/1.6/ref/models/querysets/#get-or-create). – Burhan Khalid Jan 11 '14 at 05:42

2 Answers2

1

As miki725's mentioned you don't have a problem with your current code. I'm assuming you are using the bulk_create method. It is true that the save() method is not called when using bulk_create, but the uniqueness of fields is not enforced inside the save() method. When you use unique_together a unique constraint is added to the underlying table in mysql when creating the table:

Django:

unique_together = (('name', 'number'),)

MySQL:

UNIQUE KEY `name` (`name`,`number`)

So if you insert a value into the table using any method (save, bulk_insert or even raw sql) you will get this exception from mysql:

Duplicate entry 'value1-value2' for key 'name'

UPDATE:

What bulk_insert does is that it creates one big query that inserts all the data at once with one query. So if one of the entries is duplicate, it throws an exception and none of the data is inserted.

1- One option is to use batch_size parameter of bulk_insert and make it insert the data in a number of batches so that if one of them fails you only miss rest of the data of that batch. (depends how important it is to insert all the data and how frequent the duplicate entries are)

2- Another option is to write a for loop over the bulk data and insert the bulk data one by one. This way the exception is thrown for that one row only and the rest of the data is inserted. This is gonna query the db every time and is of course a lot slower.

3- Third option is to lift the unique constraint, insert the data using bulk_create and then write a simple query that deletes the duplicate rows.

GDorn
  • 8,511
  • 6
  • 38
  • 37
jurgenreza
  • 5,856
  • 2
  • 25
  • 37
  • Will this still insert the rest of the records? – NickCSE Mar 07 '13 at 14:45
  • Unfortunately, I can't miss any of the data, each record must make it's way into the database. I'm currently doing something like #2, and it's slow enough to cause me to ask this question. I'd given thought to #3, but that doesn't seem at all scalable. Each bulk insert can be 10's of thousands of records. – NickCSE Mar 07 '13 at 19:56
  • @NickCSE I would go with #3. It does not have any scalability problems. The bulk_create is scalable of course. The query to delete duplicate rows could also be written in an efficient way. (don't use django for that) – jurgenreza Mar 07 '13 at 20:23
  • @NickCSE So is your problem solved? If you have problem with writing the remove duplicate query, you can ask it in a new question. – jurgenreza Mar 08 '13 at 20:59
  • Not really, I still feel that this solution isn't scalable. I'm going to do some testing with variants of temporary database setups like this to see what works. – NickCSE Mar 09 '13 at 04:04
0

Django itself does not enforce the unique_together meta attribute. This is enforced by the database using the UNIQUE clause. You can insert as much data as you want and you are guaranteed that the specified fields will be unique. If not, then an exception will be raised (not sure which one). More about unique_together in the docs.

miki725
  • 27,207
  • 17
  • 105
  • 121