T-SQL CROSS APPLY with GROUP BY

Question

I'm new to CROSS APPLY and trying to understand the mechanics of exactly how it works. Specifically, in doing some tests, I found that including a GROUP BY clause within the CROSS APPLY statement drastically improves the performance of aggregations, but this seems a bit counter-intuitive. I suppose what's confusing me is the precise order of operations.

Here's my test:

declare @cust table (CUSTID int, NAME varchar(30), MaxOrder decimal, TotalAmountSpent decimal, OrderCount int) 
declare @order table (OID int, CUSTID int, AMOUNT decimal)

insert into @cust values (01, 'Fred', 0, 0, 0)
insert into @cust values (02, 'Mary', 0, 0, 0)
insert into @cust values (03, 'Karl', 0, 0, 0)

insert into @order values (20, 01, 6.00)
insert into @order values (21, 03, 10.00)
insert into @order values (22, 03, 20.00)

update @cust
   set MaxOrder = app.MaxOrder, TotalAmountSpent = app.TotalAmountSpent, OrderCount = app.OrderCount
  from @cust c

 cross apply (
               select MAX(AMOUNT) MaxOrder, SUM(AMOUNT) TotalAmountSpent, COUNT(OID) OrderCount 
                 from @order o 
                where c.CUSTID = o.CUSTID 
                group by o.CUSTID
             ) app

select * from @cust

This yields correct results:

CUSTID  NAME    MaxOrder    TotalAmountSpent    OrderCount
1       Fred           6                   6             1
2       Mary           0                   0             0
3       Karl          20                  30             2

Commenting out the GROUP BY causes Mary's values to be written over as NULL:

CUSTID  NAME    MaxOrder    TotalAmountSpent    OrderCount
1       Fred           6                   6             1
2       Mary        NULL                NULL             0
3       Karl          20                  30             2

So while both result sets could be regarded as "correct", the first method only affects the rows that are actually relevant. Over a much larger data set, this appears to improve performance quite a bit.

Here's what I'm confused about: In general, I believe that in any SQL statement, the WHERE clause will be processed before a GROUP BY clause, no? In this case, does the SQL Server query optimizer know to execute the GROUP BY first, before applying the WHERE clause between the left and right tables? It surprised me that writing it this way led to both correct results and better performance. An explanation of exactly what's going on under the hood would be much appreciated.

Thanks!

score 3 · Answer 1 · answered Oct 09 '14 at 22:58

It's not about the where clause and the group by clause being executed asynchronously, but rather what the query optimizer sees as the most efficient path. Looking at the query plan with the group by, there are two sorts introduced, one for each table, before the stream aggregate and merge join. The sorted lists will be faster to aggregate when grouping than unsorted lists - there would be less comparison needed and less checking/IO required - just every interval on which your grouping expression changes, it sets up a new group and continues to stream your numeric values in.

On the other hand, without it, your applied query with where clause is sufficient to only return 1 row, thus not disrupting your result set, because it's all aggregate functions. Without the group by, there's no need to track changes in any expression, and just feeds the aggregates anything matching from the where clause criteria.

Are the results the same? Not quite, but doing a simple coalesce to zero is simpler than the processing cost associated with the two sorts in the other query plan.

I can't up-vote yet, but this was helpful, thank you. – SQLDM Oct 10 '14 at 14:01 — SQLDM, Oct 10 '14 at 14:01

score 2 · Answer 2 · edited May 23 '17 at 11:51

Interesting behaviour. Strictly speaking, your query is not correct - if you have no orders for Mary but still want to update her record, you should use outer apply instead of cross. Also, it would probably be better to handle this "no records" scenario with isnull() wrappers in the set part.

Right now, values in Mary's row aren't rewritten with zeroes - they are left unchanged, because apply does not return anything for her. You can see this by changing your table initialisation as follows:

insert into @cust values (01, 'Fred', -1, -1, -1)
insert into @cust values (02, 'Mary', -1, -1, -1)
insert into @cust values (03, 'Karl', -1, -1, -1)

With group by in place, Mary's row doesn't get zeroes, it still has all those -1's. It's exactly the same behaviour when you try to assign a value to a scalar variable with a query that returns no rows - the variable will still hold its previous value after that. It is a documented and well known feature.

Having said that, though, it is still very interesting (to me, at least), why commenting out group by changes the behaviour so drastically. We can narrow it down by looking at the results of the apply subquery, like this:

select MAX(AMOUNT) MaxOrder, SUM(AMOUNT) TotalAmountSpent, COUNT(OID) OrderCount
from @order o 
where o.CUSTID = 2;

select MAX(AMOUNT) MaxOrder, SUM(AMOUNT) TotalAmountSpent, COUNT(OID) OrderCount
from @order o 
where o.CUSTID = 2
group by o.CUSTID;

As it appears, specifying grouping criteria works as an additional filter. This is probably the way aggregation is implemented in SQL Server.

EDIT: After some search, I have found that Oracle works exactly the same way. So it is a standard behaviour, it seems. Also, there is a discussion of this effect here: Count Returning blank instead of 0

In short, group by filters out groups that are not present, so when you are specifying a customer with no sales, you got nothing. Without grouping, however, there is no such filtering stage, so you receive aggregates for the whole table - nulls for max and sum and zero for count. And in your particular example the group by is actually unnecessary, because all returned columns are aggregates (which is quite rare).

In my initial tests, I did have OUTER APPLY as part of the main query (no update statement), because as you mentioned, that would be the naturally correct approach. However, that was when I first noticed the performance issues with larger data sets. So, that led me to trying update statements, and the CROSS with GROUP BY. To me, it didn't seem intuitive that it would work, but I've checked results again and again, and it does appear to yield correct aggregations AND, as you described it, act as an additional filter that improves performance. (Don't have enough rep to upvote yet, but thanks!) — SQLDM, Oct 10 '14 at 13:56

T-SQL CROSS APPLY with GROUP BY

2 Answers2