0

I have a bunch of object in a response object, which I save into the database. Doing it object by object is very slow since that essentially means if its 30k objects, its gonna make 30k commits to the database afaik.

Example 1:

for obj in response['RESULTS']:

    _city = City.objects.create(
        id=obj['id'],
        name=obj['name'],
        shortname=obj['shortname'],
        location=obj['location'],
        region=region_fk
    )

    _events = Event.objects.get(pk=obj['Event'])
    _events.city_set.add(_city)

My new approach to implement bulk_create() is something like this:

Example 2:

bulk_list = []

for obj in response['RESULTS']:

    # get the foreignkey instead of duplicating data

    if obj.get('Region'):
        region_fk = Region.objects.get(pk=obj['Region'])      

    bulk_list.append(
        City(
            id=obj['id'],
            name=obj['name'],
            shortname=obj['shortname'],
            location=obj['location'],
            region=region_fk
        )
    )

bulk_save = City.objects.bulk_create(bulk_list)

While this is alot faster than my previous attempt it has a problem, now I dont know how to add my M2M relationships.

models.py

class City(models.Model):

    id = models.CharField(primary_key=True, max_length=64)
    name = models.CharField(max_length=32)
    shortname = models.CharField(max_length=32)
    location = models.CharField(max_length=32)
    region = models.ForeignKey(max_length=32)
    events = models.ManyToManyField(Event)


class Event(models.Model):

    id = models.CharField(primary_key=True, max_length=64)
    description = models.TextField()
    date = models.DateTimeField()

class Region(models.Model):

    id = models.IntegerField(primary_key=True)

Questions

I have looked around on stackoverflow and have found some examples but I dont understand them completely.. it seems like most answers talks about bulk_create the M2M relations aswell via a through model, and im not sure thats what I am looking for.

  1. How can I add those M2M relations?
  2. Please break it down so I can understand, I want to learn :-)

Any help or pointers are highly appreciated. Thank you.

Other information

I run:

  • postgresql
  • django==1.11

Related posts

Django docs on this topic

Response example:

"RESULT": [
  {
    "City": [
      {
        "id": "349bc6ab-1c82-46b9-889e-2cc534d5717e",
        "name": "Stockholm",
        "shortname": "Sthlm",
        "location": "Sweden",
        "region": [
          2
        ],
        "events": [
          {
            "id": "989b6563-97d2-4b7d-83a2-03c9cc774c21",
            "description": "some text",
            "date": "2017-06-19T00:00:00"
          },
          {
            "id": "70613514-e569-4af4-b770-a7bc9037ddc2",
            "description": "some text",
            "date": "2017-06-20T00:00:00"
          },
            {
            "id": "7533c16b-3b3a-4b81-9d1b-af528ec6e52b",
            "description": "some text",
            "date": "2017-06-22T00:00:00"
          },
      }
  }
]
Niclas
  • 510
  • 6
  • 19

1 Answers1

0

It depends.

If you have M2M relationship without explicit through model, then possible solution with Django ORM would be:

from itertools import groupby

# Create all ``City`` objects (like you did in your second example):
cities = City.objects.bulk_create(
    [
        City(
            id=obj['id'],
            name=obj['name'],
            shortname=['shortname'],
            location=['location'],
            region=['region']
        ) for obj in response['RESULTS']
    ]
)

# Select all related ``Event`` objects.
events = Event.objects.in_bulk([obj['Event'] for obj in response['RESULTS']])

# Add all related cities to corresponding events:
for event_id, event_cities_raw in groupby(response['RESULTS'], lambda x: x['Event']):
    event = events[event_id]
    # To avoid DB queries we can gather all cities ids from response
    city_ids = [city['id'] for city in event_cities_raw]
    # And get saved objects from bulk_create result, which are required for ``add`` method.
    event_cities = [city for city in cities if city.pk in city_ids]
    event.city_set.add(*event_cities)

1 bulk_create query, 1 in_bulk query + 1 query for each unique Event in response (event.city_set.add does single UPDATE query by default).

With explicit through model it should be possible to use another bulk_create for this model, in other words replace all event.city_set.add queries with single ExplicitThrough.objects.bulk_create.

Possibly, you'll need to handle situation when Event from response['RESULTS'] doesn't exist, then you'll have to create these objects with another bulk_create.

Addressing your comment:

If there will be some events in response['RESULTS'] which don't exist in the DB. In this case you can do another bulk_create just under Event.objects.in_bulk query:

new_events = Event.objects.create_bulk([obj['Event'] for obj in response['RESULTS'] if obj['Event']['id'] not in events])

But here, it depends on the object structure in response['RESULTS']. But in general you need to create missing events here. It should be faster than using Event.objects.get_or_create calls.

Stranger6667
  • 418
  • 3
  • 17
  • Thank you for a interesting and explanatory reply, this seems to work - but you are right about that I need to handle the situation when Event does not exist in the reply and in the database either. Since I am quite new to python and django, where would this check need to happen? edit: sent reply before I was done writing the first one, fixed – Niclas Jun 19 '17 at 14:14
  • I updated the answer. Could you, please, specify how single element from ``response['RESULTS']`` looks like? – Stranger6667 Jun 19 '17 at 14:51
  • Hello, I have updated with a response example in json in original post. In the code you provided you say `event = events[event_id]` and then `event.city_set.add(*event_cities)` - doesnt `event` need to be an instance of a event? like `event = Event.objects.get(event_id=obj['id'])`? and then `event.city_set.add(*event_cities)`? Sorry if im confusing you with my own confusion :) – Niclas Jun 19 '17 at 15:24
  • [..] further thinking about this, what if the Event also had a foreignkey field to some other model? What would happen then? Currently I am a little mind boggled about this, it seems I need something that ties everything together to fully understand what is going on here. – Niclas Jun 19 '17 at 16:49
  • ``in_bulk`` returns a dictionary like this {: }, so this code should work :) – Stranger6667 Jun 19 '17 at 17:01
  • For foreign keys in Event everything could be like this - create / select all related objects (in the same way as existing / not existing events), add them to Event objects dictionaries and create events as above – Stranger6667 Jun 19 '17 at 17:19
  • Alright I think I understand, I am gonna experiment some more with this and reply back once I have a somewhat working solution. In the meantime, thank you for taking the time to explain :) – Niclas Jun 19 '17 at 17:42
  • Hi again, I have been experimenting some now, and it seems to work but I need to do more testing. Another maybe stupid question but in your example you are using list comprehension `[...] for obj in response['RESULT']` - but where do I put the validation before adding a object to the list? I have some default values that needs to be put into the list if `None` are in the response.. – Niclas Jun 19 '17 at 18:29
  • If you want to add some kind of validation, then you could use ``if``: ``[i for i in response['RESULTS'] if i['description'] is not None]`` - in this code resulting list will contain only objects, which have not None description value. If you need to put some default values, you have can use dict's ``.get()`` method - ``[{'id': obj['id'], 'description': obj.get('description', 'DEFAULT_DESCRIPTION')} for obj in response['RESULT']]`` – Stranger6667 Jun 19 '17 at 18:58
  • I have updated `Example 2` and `models.py` in original post - maybe this reflects what I am trying to do more efficiently. I imagine doing alot of those checks down the road since the data tend to be inconsistent. So basically I want to do that type of checking in `Example 2`, but with list comprehension the way you showed in your example since that should be even faster than appending to list. – Niclas Jun 19 '17 at 19:25