26

I have this code for table populating.

def add_tags(count):
    print "Add tags"
    insert_list = []
    photo_pk_lower_bound = Photo.objects.all().order_by("id")[0].pk
    photo_pk_upper_bound = Photo.objects.all().order_by("-id")[0].pk
    for i in range(count):
        t = Tag( tag = 'tag' + str(i) )
        insert_list.append(t)
    Tag.objects.bulk_create(insert_list)
    for i in range(count):
        random_photo_pk = randint(photo_pk_lower_bound, photo_pk_upper_bound)
        p = Photo.objects.get( pk = random_photo_pk )
        t = Tag.objects.get( tag = 'tag' + str(i) )
        t.photos.add(p)

And this is the model:

class Tag(models.Model):
    tag = models.CharField(max_length=20,unique=True)
    photos = models.ManyToManyField(Photo)

As I understand this answer : Django: invalid keyword argument for this function I have to save tag objects first (due to ManyToMany field) and then attach photos to them through add(). But for large count this process takes too long. Are there any ways to refactor this code to make it faster?

In general I want to populate Tag model with random dummy data.

EDIT 1 (model for photo)

class Photo(models.Model):
    photo = models.ImageField(upload_to="images")
    created_date = models.DateTimeField(auto_now=True)
    user = models.ForeignKey(User)

    def __unicode__(self):
       return self.photo.name
Community
  • 1
  • 1
Georgy
  • 2,410
  • 3
  • 21
  • 35

2 Answers2

51

TL;DR Use the Django auto-generated "through" model to bulk insert m2m relationships.

"Tag.photos.through" => Django generated Model with 3 fields [ id, photo, tag ]
photo_tag_1 = Tag.photos.through(photo_id=1, tag_id=1)
photo_tag_2 = Tag.photos.through(photo_id=1, tag_id=2)
Tag.photos.through.objects.bulk_insert([photo_tag_1, photo_tag_2, ...])

This is the fastest way that I know of, I use this all the time to create test data. I can generate millions of records in minutes.

Edit from Georgy:

def add_tags(count):
    Tag.objects.bulk_create([Tag(tag='tag%s' % t) for t in range(count)])

    tag_ids = list(Tag.objects.values_list('id', flat=True))
    photo_ids = Photo.objects.values_list('id', flat=True)
    tag_count = len(tag_ids)
       
    for photo_id in photo_ids:
        tag_to_photo_links = []
        shuffle(tag_ids)

        rand_num_tags = randint(0, tag_count)
        photo_tags = tag_ids[:rand_num_tags]

        for tag_id in photo_tags:
            # through is the model generated by django to link m2m between tag and photo
            photo_tag = Tag.photos.through(tag_id=tag_id, photo_id=photo_id)
            tag_to_photo_links.append(photo_tag)

        Tag.photos.through.objects.bulk_create(tag_to_photo_links, batch_size=7000)

I didn't create the model to test, but the structure is there you might have to tweaks some stuff to make it work. Let me know if you run into any problems.

[edited]

Du D.
  • 5,062
  • 2
  • 29
  • 34
  • Hi, sorry for a delayed answer. I can say you do have the right idea to use `through`, I did find the same solution for myself, though this feature is short on docs, could you advise some for me? You are an advanced Python dev, compared to me at least, and I have to read some docs to understand your answer completely, though I have to admit that simple copy&past didn't work for me. Thanks a lot for your help! I'll try to add additional info later. – Georgy Dec 07 '15 at 08:49
  • 1
    can you post the model definition for Photo? – Du D. Dec 07 '15 at 15:49
  • Sorry, it took me a while. The model is in EDIT section. – Georgy Dec 10 '15 at 12:20
  • I think there is a typo it should read Tag.Photos.through, not Tag.Photo.through. copy the code above and try again, it should be good now. – Du D. Dec 10 '15 at 20:57
  • 3
    Hi. I added an edit to your answer earlier but it was rejected. There are some mistakes in your code, would you please modify it for the record? 1) It is not `Tag.Photos.through` but `Tag.photos.through`. 2) `Photo.objects.value_list` to `Photo.objects.values_list` (typo here). 3) You can't shuffle tag_ids this way, use `list()` to convert it. 4) And you have to move the last `bulk_create()` line out of the `forloop`, otherwise the code tries to add duplicates. Thank you in advance for your time! – Georgy Dec 24 '15 at 11:19
  • i'll make the changes, execept for number 4. I think it would be fine. Thanks for the fixes! – Du D. Dec 27 '15 at 15:05
  • 5
    Just an update.. Tag.photos.through.bulk_insert() will lead to SOMETHING has no attribute bulk_insert(). Instead, we should use Tag.photos.through.objects.bulk_create(). – Ebram Shehata Apr 08 '21 at 19:22
13

As shown in Du D's answer, Django ManyToMany fields use a table called through that contains three columns: the ID of the relation, the ID of the object linked to and the ID of the object linked from. You can use bulk_create on through to bulk create ManyToMany relations.

As a quick example, you could bulk create Tag to Photo relations like this:

tag1 = Tag.objects.get(id=1)
tag2 = Tag.objects.get(id=2)
photo1 = Photo.objects.get(id=1)
photo2 = Photo.objects.get(id=2)


through_objs = [
    Tag.photos.through(
        photo_id=photo1.id,
        tag_id=tag1.id,
    ),
    Tag.photos.through(
        photo_id=photo1.id,
        tag_id=tag2.id,
    ),
    Tag.photos.through(
        photo_id=photo2.id,
        tag_id=tag2.id,
    ),
]
Tag.photos.through.objects.bulk_create(through_objs)

General solution

Here is a general solution that you can run to set up ManyToMany relations between any list of object pairs.

from typing import Iterable
from collections import namedtuple


ManyToManySpec = namedtuple(
    "ManyToManySpec", ["from_object", "to_object"]
)


def bulk_create_manytomany_relations(
    model_from,
    field_name: str,
    model_from_name: str,
    model_to_name: str,
    specs: Iterable[ManyToManySpec]
):
    through_objs = []
    for spec in specs:
        through_objs.append(
            getattr(model_from, field_name).through(
                **{
                    f"{model_from_name.lower()}_id": spec.from_object.id,
                    f"{model_to_name.lower()}_id": spec.to_object.id,
                }
            )
        )
    getattr(model_from, field_name).through.objects.bulk_create(through_objs)

Example usage

tag1 = Tag.objects.get(id=1)
tag2 = Tag.objects.get(id=2)
photo1 = Photo.objects.get(id=1)
photo2 = Photo.objects.get(id=2)

bulk_create_manytomany_relations(
    model_from=Tag,
    field_name="photos",
    model_from_name="tag",
    model_to_name="photo",
    specs=[
        ManyToManySpec(from_object=tag1, to_object=photo1),
        ManyToManySpec(from_object=tag1, to_object=photo2),
        ManyToManySpec(from_object=tag2, to_object=photo2),
    ]
)
Shaun Taylor
  • 326
  • 2
  • 6
  • Isn't the problem in the question that these object's aren't saved and therefore don't have `id`? I can't see how this would work? It would still create a `ValueError`? – alias51 Sep 25 '21 at 09:54
  • The problem in the question is that they are iterating over ` for i in range(count):` and then doing `t.photos.add(p)` for each item. The `bulk_create` isn't costly and that's not the problem. Admittedly, to use my solution they'd have to iterate over `Photo.objects.all()` and `Tag.objects.all()` or some `filter` version of that. I can't recall if that does one query or many... maybe they'd need to cast it to a list. – Shaun Taylor Nov 18 '21 at 17:59
  • Thanks @ShaunTaylor for this. Would you know how to do if I have a bulk_update? Thanks – Courvoisier Nov 14 '22 at 21:42