1

Getting into deep water with Subquery. I have a set of Carparks. Carparks have multiple Bookings. Bookings have many BarrierActivity records, which are the various coming and going events at the barriers. These are all simple FKs up the stack.

It is possible for a booking to arrive and the barrier cameras not recognise it. A member of staff will buzz them in but that means the system has failed for some reason. And that's what I'm trying to do here. Work out what percentage of my bookings got in via automated means. I know there are a number of other ways of doing this, but I'd like to do this with a single subquery-based queryset.

My aim reasonably simple. Annotate 0 or 1 to show whether or not an "entry" BarrierActivity exists for each Booking. Annotate an average of those values, per Carpark.

The first part is fine. I can do a simple Exists() between BarrierActivity and Booking and then each booking has the 0 or 1:

successful_bas = BarrierActivity.objects.order_by().filter(
    booking=OuterRef('pk'),
    activity_type=BarrierActivity.TYPE_ANPR_BOOKING,
    direction='entry'
).values('booking')

Booking.objects.order_by().annotate(
    entry_success=Exists(successful_bas)
)

And again, that works fine. But as soon as I try to scale that up another layer (so looking at Carpark instead of Booking)...

successful_bas = BarrierActivity.objects.order_by().filter(
    booking=OuterRef('pk'),
    activity_type=BarrierActivity.TYPE_ANPR_BOOKING,
    direction='entry'
).values('booking')

bookings = Booking.objects.order_by().filter(
    carpark=OuterRef('pk')
).values('carpark').annotate(
    entry_success=Exists(successful_bas)
).values('entry_success')

Carpark.objects.order_by().annotate(
    entry_hitrate=ExpressionWrapper(
        Avg(Cast(Subquery(bookings), IntegerField())) * 100,
        output_field=FloatField()
    )
)

... I get the Subquery-error-of-doom: more than one row returned by a subquery used as an expression. The bookings subquery it clearly returning too much but how do I aggregate that before it hits the outermost subquery?


I've tried many things, but here's a reorganisation of the average to within the subquery. Same error:

successful_bas = "... as before ..."

bookings = Booking.objects_standard.order_by().filter(
    carpark=OuterRef('pk')
).values('site').annotate(
    entry_success=Exists(successful_bas)
).annotate(
    entry_avg=Avg(Cast('entry_success', IntegerField()))
).values('entry_avg')

Carpark.objects.order_by().annotate(
    entry_hitrate=ExpressionWrapper(
        Subquery(bookings, output_field=FloatField()) * 100,
        output_field=FloatField()
    )
)
Oli
  • 235,628
  • 64
  • 220
  • 299
  • Subquery is returning multiple. Take the average before calling cast – Brobin Jun 07 '17 at 13:09
  • The Cast is required for the Avg to work (it's doing bool→int). I *have* tried several alternatives —to no good effect— I'll edit in the one most like what you're suggesting. – Oli Jun 07 '17 at 13:30
  • @Oli your approach looks unnecessarily overcomplicated (casting bool to int then taking the average of it?), can you write down in a plain SQL what you want to achieve, and I guess a simpler approach can be found. – Todor Jun 08 '17 at 07:09
  • @Todor, I don't think I *can* write what I want in SQL. English and Python are the best I've got. I'm trying to work out what proportion of bookings at each carpark have a particular sort of BarrierActivity. I could count the bookking-distinct barrieractivities at a carpark and avg that against a separate count of carpark bookings, but it's less accurate. I'm actually working on denormalising the entrance and exit BAs so they're linked back from each booking. This will make my analysis much easier, but I'd still like to know —for my benefit as a Django developer— how to do nested subqueries. – Oli Jun 08 '17 at 14:34
  • @Oli, can you post a complete model structure + maybe some test data, so I can reproduce this (+bonus expected result :D)? I would like to play with this example, but I'm unable to reproduce it. – Todor Jun 11 '17 at 06:48

1 Answers1

0

I was able to re-create part of this in one of my own projects and adding a distinct('<values_field_name>').order_by() to the outer subquery solved it.

The distinct call is needed to reduce the number of rows returned. The field name in the distinct call is required otherwise it tries to do the distinct call across another field. Then the order_by() call clears any ordering so that the query has the ordering on the distinct expression matching the initial expression's ordering.

This is what I'd try:

successful_bas = BarrierActivity.objects.order_by().filter(
    booking=OuterRef('pk'),
    activity_type=BarrierActivity.TYPE_ANPR_BOOKING,
    direction='entry'
).values('booking')

bookings = Booking.objects.order_by().filter(
    carpark=OuterRef('pk')
).annotate(
    entry_success=Exists(successful_bas)
).values('carpark').distinct('carpark').order_by()

Carpark.objects.order_by().annotate(
    entry_hitrate=ExpressionWrapper(
        Avg(Cast(Subquery(bookings), IntegerField())) * 100,
        output_field=FloatField()
    )
)

Note that I removed one of the values call in the outer subquery.

schillingt
  • 13,493
  • 2
  • 32
  • 34
  • A couple of problems. As is, it returns the carpark IDs. If I change that last values() call to `entry_success`, it seems to just pick the first, or something like it. It's returning zero. I've stuffed some new BarrierActivity records in in just to make sure it shouldn't return zero too. – Oli Jun 07 '17 at 22:26
  • I was afraid that your specific case was going to be too different than my test case. Sorry. Try printing out the query and reverse engineering the ORM solution. – schillingt Jun 07 '17 at 22:42