0

I have the following code to take data from a very large csv, import it into a Django model and convert it into nested categories (mptt model).

with open(path, "rt") as f:
        reader = csv.reader(f, dialect="excel")
        next(reader)
        for row in reader:
            if int(row[2]) == 0:

                obj = Category(
                    cat_id=int(row[0]),
                    cat_name=row[1],
                    cat_parent_id=int(row[2]),
                    cat_level=int(row[3]),
                    cat_link=row[4],
                )
                obj.save()
            else:
                parent_id = Category.objects.get(cat_id=int(row[2]))
                obj = Category(
                    cat_id=int(row[0]),
                    cat_name=row[1],
                    cat_parent_id=int(row[2]),
                    cat_level=int(row[3]),
                    cat_link=row[4],
                    parent=parent_id,
                )
                obj.save()

It takes over an hour to run this import. Is there a more efficient way to do this? I tried a bulk_create list comprehension, but found I need

parent_id = Category.objects.get(cat_id=int(row[2]))

to get mptt to nest correctly. Also, I think I have to save each insertion for mptt to properly update. I'm pretty sure the parent_id query is making the operation so slow as it has to query the database for every row in the CSV (over 27k rows).

The upside is this operation will only need to be done sporadically, but this dataset isn't so large that it should take over an hour to run.

Edit

Category Model

class Category(MPTTModel):
cat_id = models.IntegerField(null=False, name="cat_id")
cat_name = models.CharField(max_length=100,
                            null=False,
                            name="cat_name",
                            default="")
cat_parent_id = models.IntegerField(null=False, name="cat_parent_id")
cat_level = models.IntegerField(null=False, name="cat_level")
cat_link = models.URLField(null=True, name="cat_link")
parent = TreeForeignKey("self",
                        on_delete=models.CASCADE,
                        null=True,
                        blank=True,
                        related_name="children")

def __str__(self):
    return f"{self.cat_name}"

class MPTTMeta:
    order_insertion_by = ["cat_name"]

Edit 2

cat_parent_id = int(row[2])
obj = Category(
                cat_id=int(row[0]),
                cat_name=row[1],
                cat_parent_id=int(row[2]),
                cat_level=int(row[3]),
                cat_link=row[4],
                parent={cat_parent_id}_id,
            )
megler
  • 217
  • 1
  • 2
  • 9
  • Can you share your `Category` model? You seem to have a `cat_parent_id` and a `parent` field that both contain almost identical data? – Iain Shelvington Apr 21 '22 at 02:55
  • The cat_parent_id is from the csv and parent is required by mptt for child entries. I had been working all day just to get nesting to work. This CSV goes 9 levels deep so I haven't tried to remove anything yet just in case I needed the relationship. – megler Apr 21 '22 at 03:43
  • If a `TreeForeignKey` behaves similarly to a `ForeignKey` you should be able to assign the raw id directly using `{field_name}_id`, so you don't have to fetch the related object - `parent_id= int(row[2]),`. If it works then you reduce the number of queries significantly, does it work? – Iain Shelvington Apr 21 '22 at 03:50
  • Sorry for the ignorance, but what goes inside {field_name}? I've tried parent={cat_parent_id}_id, and parent={int(row[2])}_id and also removed the {} -- how do I use {field_name}_id ? All of those iterations error with a missing paren. see edit – megler Apr 21 '22 at 04:21
  • Try `parent_id=int(row[2])`, don't assign to the `parent` field at all. `{field_name}_id` was just an example of the format of the field/attribute used for accessing the raw value of a ForeignKey, you append `_id` to the field name – Iain Shelvington Apr 21 '22 at 04:22
  • Error: Category.DoesNotExist: Category matching query does not exist. I had that issue earlier where it won't recognize calling the row like that. – megler Apr 21 '22 at 04:24
  • Are you still checking for `if int(row[2]) == 0:` or trying to assign `0` to `parent_id`? Passing `0` to `parent_id` would almost certainly give that error – Iain Shelvington Apr 21 '22 at 04:29
  • if int(row[2]) == 0: just re-read what you said. Yes, still have the if statement. I do know for mptt, the parent has to call an instance, not just a row. That's why the extra query is in there. – megler Apr 21 '22 at 04:30
  • The only thing I can think is to disable mptt behaviour while doing the import using something like https://django-mptt.readthedocs.io/en/latest/mptt.managers.html#mptt.managers.TreeManager.disable_mptt_updates? Sorry I can't help further – Iain Shelvington Apr 21 '22 at 04:42

0 Answers0