0

Thanks for reading my question! I use django (3.0.8) with postgres 12. Below is a simplified model for Inventory. There are about 1M records.

class Inventory(models.Model):
    account = models.ForeignKey(Account, on_delete=models.PROTECT)
    item_id = models.LargeIntegerField(unique=True, db_index=True)
    amount = models.IntegerField()

Every hour we receive new snapshots of one Account through REST API, e.g. acc_0. That contains a full list of items (~100k records), not deltas. I wanted to apply 3 actions:

  1. Set amount=0 where account==acc_0 if item_id not in the new snapshots.
  2. Set amount=snapshot['amount'] where account==acc_0 if item_id in the new snapshots.
  3. Create new items if snapshot['item_id'] not in Inventory.item_id

Each time, most items already exist in DB and their amount is unchanged, i.e. the real delta is quite small (100-1k records).

What I'm doing now seems not very efficient:

with transaction.atomic():
    new_items = {}
    update_items = {}
    Inventory.objects.filters(account=acc_0).update(amount=0)
    for snapshot in snapshots:
        item_id = snapshot['item_id']
        results = Inventory.objects.filter(item_id=item_id)
        if len(results) == 0:
            new_items[item_id] = Inventory(...)
        else:
            item = result[0]
            item.amount = snapshot['amount']
            update_items[item_id] = item
    Inventory.objects.bulk_create(new_items.values())
    Inventory.objects.bulk_update(update_items.values(), ['amount'])

I was wondering should I upload the snapshot to a temporary table and use UPDATE SET CASE JOIN, INSERT INTO SELECT NOT EXISTS or even better there is a more pythonic way.

There is one similar question: Django Mass Update/Insert Performance but it's also open.

anch2150
  • 81
  • 6

1 Answers1

0

As you have three questions, I will answer your question in 3 separate parts. I'm no expert on Postgres, but I can tell you the most convenient way of solving your problem using Django.

I won't say efficient as I haven't handled large datasets, but seeing as they're all Django-default functions, I will expect them to perform fairly well.

1 and 2: I will assume that your account objects are already sorted in terms of their id. If they are sorted, a binary search algorithm will do the trick (only equal to linear search at its worse case, best at log(N) where N is the time for linear search). Once you find the item, do whatever you need to. If they are not sorted, don't bother yourself and just go for a standard linear search.

3: You either get the object, or you create one. Django's get_or_create is exactly what you need.

crimsonpython24
  • 2,223
  • 2
  • 11
  • 27
  • Yes, `account` is indexed. `Inventory.objects.filters().update() and bulk_*()` are not so bad. The slowest part seems to be: `for snapshot in snapshots: results = Inventory.objects.filter(item_id=item_id) ` – anch2150 Jul 19 '20 at 13:13
  • where is `snapshot` in the second line? – crimsonpython24 Jul 19 '20 at 14:34
  • It's retrieved every hour from a REST API. – anch2150 Jul 19 '20 at 21:24
  • First of all, you might want to consider using python's list comprehension, which optimizes the speed, so it will look sth like `results = [Inventory.objects.filter(item_id=snapshot['item_id']) for snapsnot in snapshots]`. Besides, I'm suspicious whether the if-else statement should belong inside the for look – crimsonpython24 Jul 20 '20 at 00:42