4

I have this model.

class Item(models.Model):
    name=models.CharField(max_length=128)

An Item gets transferred several times. A transfer can be successful or not.

class TransferLog(models.Model):
    item=models.ForeignKey(Item)
    timestamp=models.DateTimeField()
    success=models.BooleanField(default=False)

How can I query for all Items which latest TransferLog was successful?

With "latest" I mean ordered by timestamp.

TransferLog Table

Here is a data sample. Here item1 should not be included, since the last transfer was not successful:

ID|item_id|timestamp           |success
---------------------------------------
1 | item1 |2014-11-15 12:00:00 | False
2 | item1 |2014-11-15 14:00:00 | True
3 | item1 |2014-11-15 16:00:00 | False

I know how to solve this with a loop in python, but I would like to do the query in the database.

hynekcer
  • 14,942
  • 6
  • 61
  • 99
guettli
  • 25,042
  • 81
  • 346
  • 663

4 Answers4

4

An efficient trick is possible if timestamps in the log are increasing, that is the end of transfer is logged as timestamp (not the start of transfer) or if you can expect ar least that the older transfer has ended before a newer one started. Than you can use the TransferLog object with the highest id instead of with the highest timestamp.

from django.db.models import Max
qs = TransferLog.objects.filter(id__in=TransferLog.objects.values('item')
           .annotate(max_id=Max('id')).values('max_id'), success=True)

It makes groups by item_id in the subquery and sends the highest id for every group to the main query, where it is filtered by success of the latest row in the group. You can see that it is compiled to the optimal possible one query directly by Django.

Verified how compiled to SQL:   print(qs.query.get_compiler('default').as_sql())

SELECT L.id, L.item_id, L.timestamp, L.success FROM app_transferlog L
  WHERE L.success = true AND L.id IN
   ( SELECT MAX(U0.id) AS max_id FROM app_transferlog U0 GROUP BY U0.item_id )

(I edited the example result compiled SQL for better readability by replacing many "app_transferlog"."field" by a short alias L.field, by substituting the True parameter directly into SQL and by editing whitespace and parentheses.)


It can be improved by adding some example filter and by selecting the related Item in the same query:

kwargs = {}  # e.g. filter: kwargs = {'timestamp__gte': ..., 'timestamp__lt':...}
qs = TransferLog.objects.filter(
        id__in=TransferLog.objects.filter(**kwargs).values('item')
                               .annotate(max_id=Max('id')).values('max_id'),
        success=True).select_related('item')

Verified how compiled to SQL:   print(qs.query.get_compiler('default').as_sql()[0])

SELECT L.id, L.item_id, L.timestamp, L.success, I.id, I.name
FROM app_transferlog L INNER JOIN app_item I ON ( L.item_id = I.id ) 
  WHERE L.success = %s AND L.id IN
   ( SELECT MAX(U0.id) AS max_id FROM app_transferlog U0
     WHERE U0.timestamp >= %s AND U0.timestamp < %s
     GROUP BY U0.item_id )

print(qs.query.get_compiler('default').as_sql()[1])
# result
(True, <timestamp_start>, <timestamp_end>)

Useful fields of latest TransferLog and the related Items are acquired by one query:

for logitem in qs:
    item = logitem.item  # the item is still cached in the logitem
    ...

The query can be more optimized according to circumstances, e.g. if you are not interested in the timestamp any more and you work with big data...

Without assumption of increasing timestamps it is really more complicated than a plain Django ORM. My solutions can be found here.

EDIT after it has been accepted:

An exact solution for a non increasing dataset is possible by two queries:

  • Get a set of id of the last failed transfers. (Used fail list, because it is much smaller small than the list of successful tranfers.)
  • Iterate over the list of all last transfers. Exclude items found in the list of failed transfers.

This way can be be efficiently simulated queries that would otherwise require a custom SQL:

SELECT a_boolean_field_or_expression,
       rank() OVER (PARTITION BY parent_id ORDER BY the_maximized_field DESC)
FROM ...
WHERE rank = 1 GROUP BY parent_object_id

I'm thinking about implementing an aggregation function (e.g. Rank(maximized_field) ) as an extension for Django with PostgresQL, how it would be useful.

Community
  • 1
  • 1
hynekcer
  • 14,942
  • 6
  • 61
  • 99
2

try this

# your query    
items_with_good_translogs = Item.objects.filter(id__in=
                      (x.item.id for x in TransferLog.objects.filter(success=True)) 

since you said "How can I query for all Items which latest TransferLog was successful?", it is logically easy to follow if you start the query with Item model.

I used the Q Object which can be useful in places like this. (negation, or, ...)

(x.item.id for x in TransferLog.objects.filter(success=True)

gives a list of TransferLogs where success=True is true.

gunr2171
  • 16,104
  • 25
  • 61
  • 88
doniyor
  • 36,596
  • 57
  • 175
  • 260
  • @doniyor thank you for your answer. But I think your answer does not fit to the question. The `timestamp` column is never empty in my model. – guettli Nov 15 '14 at 14:19
  • @doniyor I updated the question: I added sample data – guettli Nov 15 '14 at 14:24
  • @doniyor Sorry, your current answer contain all items which had at least one successful transfer. But I need to check the **latest** transfer. Not any or all. BTW, I think your solution could be done like this, too: Item.objects.filter(tranferlog_set__success=True) – guettli Nov 15 '14 at 19:18
0

You will probably have an easier time approaching this from the ItemLog thusly:

dataset = ItemLog.objects.order_by('item','-timestamp').distinct('item')

Sadly that does not weed out the False items and I can't find a way to apply the filter AFTER the distinct. You can however filter it after the fact with python listcomprehension:

dataset = [d.item for d in dataset if d.success]

If you are doing this for logfiles within a given timeperiod it's best to filter that before ordering and distinct-ing:

dataset = ItemLog.objects.filter(
    timestamp__gt=start,
    timestamp__lt=end
        ).order_by(
    'item','-timestamp'
        ).distinct('item')
0

If you can modify your models, I actually think you'll have an easier time using ManyToMany relationship instead of explicit ForeignKey -- Django has some built-in convenience methods that will make your querying easier. Docs on ManyToMany are here. I suggest the following model:

class TransferLog(models.Model):
    item=models.ManyToManyField(Item)
    timestamp=models.DateTimeField()
    success=models.BooleanField(default=False)

Then you could do (I know, not a nice, single-line of code, but I'm trying to be explicit to be clearer):

results = []
for item in Item.objects.all():
    if item.transferlog__set.all().order_by('-timestamp')[0].success:
        results.append(item)

Then your results array will have all the items whose latest transfer was successful. I know, it's still a loop in Python...but perhaps a cleaner loop.

user
  • 4,651
  • 5
  • 32
  • 60