0

I want to group my model objects by three fields, and delete all objects but the youngest for each group.

My model:

class DataFile(models.Model):
    filename = models.CharField(unique=True, max_length=256)
    timestamp = models.DateTimeField()
    tile_label = models.CharField(max_length=256, blank=True)
    region = models.CharField(max_length=256, blank=True)
    resolution = models.CharField(max_length=256, blank=True)

This query gives me the list of the values I want to keep:

DataFile.objects.values('resolution', 'region', 'tile_label').annotate(maxfield=Max('timestamp'))

The result of this is

{'resolution': '30', 'region': 'SRC', 'tile_label': '10r_20d', 'maxfield': datetime.datetime(2021, 9, 11, 7, 13, tzinfo=<UTC>)}
{'resolution': '30', 'region': 'NRC', 'tile_label': '10r_20d', 'maxfield': datetime.datetime(2021, 8, 16, 2, 8, tzinfo=<UTC>)}
{'resolution': '30', 'region': 'NRC', 'tile_label': '100r_200d', 'maxfield': datetime.datetime(2021, 11, 15, 23, 5, tzinfo=<UTC>)}
{'resolution': '300', 'region': 'SRC', 'tile_label': '10r_20d', 'maxfield': datetime.datetime(2021, 11, 1, 13, 46, tzinfo=<UTC>)}
{'resolution': '300', 'region': 'NRC', 'tile_label': '10r_20d', 'maxfield': datetime.datetime(2021, 11, 5, 12, 20, tzinfo=<UTC>)}
{'resolution': '300', 'region': 'NRC', 'tile_label': '100r_200d', 'maxfield': datetime.datetime(2021, 11, 18, 5, 54, tzinfo=<UTC>)}
{'resolution': '30', 'region': 'SRC', 'tile_label': '100r_200d', 'maxfield': datetime.datetime(2021, 11, 5, 21, 8, tzinfo=<UTC>)}
{'resolution': '300', 'region': 'SRC', 'tile_label': '100r_200d', 'maxfield': datetime.datetime(2021, 11, 18, 8, 29, tzinfo=<UTC>)}

I now need to filter all DataFile objects except the one with the highest date in each group of resolution, region, tile_label, and then delete them (i.e. keep the one with the highest date).

How do I do this? I need to do a GROUP BY with values() in order to get the maximum of each group, but this means that I'm not operating on the whole queryset any more.
I think there's a solution involving subqueries, but I never really got my head around how they work.

I need a database solution, so looping is not an option. I'd also prefer to avoid __in queries for performance reasons (but would ultimately do it if there's no other solution).

C4X
  • 87
  • 8

1 Answers1

0

This query gives me the list of the values I want to keep:

DataFile.objects.values('resolution', 'region', 'tile_label').annotate(maxfield=Max('timestamp'))

This query gives you the maximum timestamp, but it doesn't tell you which row it belongs to. You could get that in a second query, but that seems complicated to me.

I tried to think about which property a DataFile would need to have to qualify for deletion. The answer I found was that there must be a newer DataFile with the same resolution/region/tile_label. When I translate that to Django I get something like this:

newer = DataFile.objects.filter(
     resolution=OuterRef('resolution'),
     region=OuterRef('region'),
     tile_label=OuterRef('tile_label'),
     timestamp__gt=OuterRef('timestamp'),
)

DataFile.objects.filter(Exists(newer)).delete()

The Exists creates a Subquery and OuterRef can be used to reference fields from the outer query.

https://docs.djangoproject.com/en/4.1/ref/models/expressions/#filtering-on-a-subquery-or-exists-expressions

In SQL this would look someting like this (not tested):

DELETE from datafiles WHERE EXISTS (
     SELECT id from datafiles as datafiles2 WHERE (
         datafiles2.resolution = datafiles.resolution
         AND datafiles2.region = datafiles.region
         AND datafiles2.tile_label = datafiles.tile_label
         AND datafiles2.timestamp >= datafiles.timestamp
     )
)
tobib
  • 2,114
  • 4
  • 21
  • 35