2

I have Django code similar to this:

for obj in some_list:
    m1obj = Model1.objects.get(a=obj.a, b=obj.b, c=obj.c)
    Model2(m1=m1obj, d=obj.d, e='foo').save()

I did optimize the insert into Model2 using bulk_create, however, this is still painfully slow because of the get from Model1 (~45sec for 3k inserts).

I also tried adding:

class Meta:
    index_together = [
        ('a', 'b', 'c'),
    ]
    unique_together = [
        ('a', 'b', 'c'),
    ]

The unique_together helps a little, index_together didn't seem to have much effect.

I have a cumbersome workaround for this doing:

  1. Filter Model1 getting all the objects I will need ordered by by one or more keys, e.g. order_by('a', 'b'), and make sure Django caches the result, e.g. len()
  2. Use binary search (from bisect import bisect_left) to locate the first a then b ... etc (although there are much fewer bs and cs so just iterating is the same.

This reduces the insert time to jus over 3 seconds!

There must be a better, cleaner and maintainable way to do this. Any suggestions? Is there a way to filter/get (smartly) within Django's cache query results?

EDIT: Changed d='foo' to d=obj.d - any bulk get needs to be mappable to the tuple it belongs to otherwise I cannot the create Model2 entry.

mibm
  • 1,328
  • 2
  • 13
  • 23

2 Answers2

0

You can make a single query (as described here) that will fetch only the results that you need so no need to sort and binary search later.

I have not tested it so I don't know whether it will be faster than what you are already doing or not. Also as the SQL query will be large (according to number of records in some_list) so this query might raise error if it exceeds size defined in MySQL settings by parameter max_allowed_packet (By default 16MB as mentioned here).

import operator
from django.db.models import Q
query = reduce(operator.or_, (Q(a=obj.a, b=obj.b, c=obj.c) for x in values))
model1_objs = Model1.objects.filter(query)

Then you can do bulk_create with Model2.

Model2.objects.bulk_create([
    Model2(m1=m1, d='foo', e='bar')
    for m1 in model1_objs
])
Community
  • 1
  • 1
Muhammad Tahir
  • 5,006
  • 1
  • 19
  • 36
  • Unfortunately, your suggestion is much slower than doing the `get()` sequentially. I stopped it after 3+ minutes. Also, as I mentioned in my edit, this could (should?) have worked if parameters 'd' and 'e' were constant. Since they are not, I have no way to map `model1_objs` to the correct `obj.d` as DB queries do not guarantee an order. – mibm Mar 28 '16 at 12:13
0

How many rows does Model1 has? if it is relatively small (less than 50k) you can fetch all using filter then compare the tuple in python.

How about "some_list" is small list (less than 100), if it is you can the Q keyword to filter everything at once.

first = some_list.pop()
conditions = Q(a=first.a, b=first.b, c=first.c)
for obj in some_list:
    conditions |= Q(a=obj.a, b=obj.b, c=obj.c)

Model1.objects.filter(conditions)   # this will get your all the Model1 from ur list

Q object Ref: https://docs.djangoproject.com/en/1.9/ref/models/querysets/#q-objects

Du D.
  • 5,062
  • 2
  • 29
  • 34
  • This solution is the same as what @muhammad-tahir suggested. For the 3K rows I killed the query after a few minutes. For smaller groups it isn't much faster (it at all) than separate queries. I did improve the workaround to create a tuple to table row map, which is both fast and readable, but if rows would be 100K and not 3K I might run into memory problems... – mibm Mar 29 '16 at 12:06