0

I have two models, where one of them has a TreeManyToMany field (a many-to-many field for django-mptt models).

class ProductCategory(models.Model):
   pass

class Product(models.Model):
   categories = TreeManyToManyField('ProductCategory')

I'm trying to add categories to the Product objects in bulk by using the through-table, as seen in this thread or this.

I do this with a list of pairs: product_id and productcategory_id, which is also what my through-table contains. This yields the following error:

sqlite3.IntegrityError: UNIQUE constraint failed: products_product_categories.product_id, products_product_categories.productcategory_id

My code looks as follows:

def bulk_add_cats(generic_df):

    # A pandas dataframe where ["product_object"] is the product_id 
    # and ["found_categories"] is the productcategory_id to add to that product.
    generic_df = generic_df.explode("found_categories")
    generic_df["product_object"] = generic_df.apply(
        lambda x: Product.objects.filter(
            merchant = x["forhandler"], product_id = x["produktid"]
            ).values('id')[0]['id'], axis=1
    )
    
    # The actual code
    # Here row.product_object is the product_id and row.found_categories is one
    # particular productcategory_id, so they make up a pair to add to the through-table.
    through_objs = [
        Product.categories.through(
            product_id = row.product_object,
            productcategory_id = row.found_categories
        ) for row in generic_df.itertuples()
    ]
    Product.categories.through.objects.bulk_create(through_objs, batch_size=1000)

I have also done the following, to check that there are no duplicate pairs in the set of pairs that I want to add. There were none:

print(generic_df[generic_df.duplicated(subset=['product_object','found_categories'], keep=False)])

I suspect that the error happens because some of the product-to-productcategory relations already exist in the table. So maybe I should check if that is the case first, for each of the pairs, and then do the bulk_create. I just want to hope to retain effiency and would be sad if I have to iterate through each pair to check. Is there are way to bulk-update-or-create for this type of problem?

Or what do you think? Any help is highly appreciated :-)

1 Answers1

1

You should probably check if the reason why you are having those errors is because of duplicate data. If that's the case, I think the ignore_conflicts parameter of bulk_create can solve your problem. It will save every valid entry and ignore those who have duplicate conflict.

Check the doc: https://docs.djangoproject.com/en/3.2/ref/models/querysets/#django.db.models.query.QuerySet.bulk_create

Jack
  • 288
  • 3
  • 7
  • I saw that attribute, but I read it as ignore_conflicts would add the relation anyway - meaning duplicates would be present in the database. But I guess not. Do you think there is an efficient way to check for duplicates between the exisiting relations table and the external data? – Frederik Faarup Jan 30 '22 at 16:05
  • I guess one way to do so imply db query minimisation. So If you have a table Product and some unique value or couple of values (let say key1, key2) you can use filter and Q. Product.objects.filter(Q(key1=v1, key2=v2) | Q(key1=v3, key2=v4) | etc.).exist() – Jack Jan 30 '22 at 16:31