Inexplicably huge read counts for 'Worktable' when multiple aggregates involved

Question

A query that selects multiple aggregates from a couple million rows shows mysteriously poor performance, orders of magnitude slower than the same query in Postgres, and orders of magnitude slower than calculating each aggregate in its own separate SQL query.

STATISTICS IO reveals an inexplicably high count for logicals reads from 'Worktable', something like 157 million logicals reads when processing 50 million rows (92917 pages read from the source table). The behaviour is exactly the same in all SQL Server versions and editions that I have access to, from SQL Server 2008R2 through SQL Server 2019.

The simplest query exhibiting this behaviour which I could find looks like this:

select count(*), count(distinct m13), count(distinct m23) from foo

m13 and m23 are low-cardinality foreign keys, which are simulated by taking the modulo of the record number in the following repro script (currently configured for processing 50% of 10,000,000 rows):

-- note: modify the cross join in the CTE for X in order to achieve the desired table size
if object_id('tempdb..#T') is null begin
    create table #T (n int primary key clustered, m13 tinyint not null, m23 tinyint not null);
    with -- ISNULL() removes the stain of perceived nullability
        D as (select isnull(value, 0) as value from (values (0), (1), (2),(3),(4),(5),(6),(7),(8),(9)) v (value)),
        H as (select isnull(d1.value * 10 + d0.value, 0) as value from D d1 cross join D d0),
        K as (select isnull(h.value * 10 + d.value, 0) as value from H h cross join D d),
        M as (select isnull(k1.value * 1000 + k0.value, 0) as value from K k1 cross join K k0),
        -- using ROW_NUMBER() here to allow easy composition via cross join, for freely choosing the target table size
        X as (select isnull(row_number() over (order by (select null)), 0) as value from D cross join M)
    insert into #T
        select value, value % 13, value % 23 from X;
end;
    
-- dial the desired percentage of rows to skip via the factor here
declare @threshold int = (select floor(0.50 * max(n) + 1) from #T);

select compatibility_level, @@version as "@@version" from sys.databases where name = db_name();
select '@threshold = ' + cast(@threshold as varchar);

set statistics time on;
set statistics io on;

select count(*)                                           from #T where n >= @threshold;
select count(*), count(distinct m13)                      from #T where n >= @threshold;
select count(*), count(distinct m13), count(distinct m23) from #T where n >= @threshold;

-- logical reads from 'Worktable' for @threshold at 50%:
--     10094 for 1e4 (   12 pages read from #T)
--    100942 for 1e5 (   96 pages read from #T)
--   1306626 for 1e6 (  933 pages read from #T)
--  14891285 for 1e7 ( 9295 pages read from #T)
-- 157450233 for 1e8 (92917 pages read from #T)

print char(13) + '### now the same in piece-meal fashion ... ###';

declare @cnt int, @m13 int, @m23 int = (select count(distinct m23) from #T where n >= @threshold);

select @cnt = count(*), @m13 = count(distinct m13) from #T where n >= @threshold;

select @cnt, @m13, @m23;

set statistics io off;
set statistics time off;

Is there a way to goad SQL Server towards reasonable execution plans by rewriting the query, instead of issuing separate queries for individual aggregates? My goal here is to understand where I led the SQL engine down a dark path and how I can write better SQL in the future.

I get the results back in about 2 seconds. STAT IO is reporting -> logical reads 9561, physical reads 0. Results of just thequery is sub-second on Microsoft SQL Server 2019 (RTM-CU6) (KB4563110) - 15.0.4053.23 (X64) — Ross Bush, Mar 23 '23 at 17:02
Out of interest, what compatibility is your database in or are you forcing the use of the old cardinality estimator (CE)? Using the *old* CE I get *terrible* performance (minutes vs seconds) and a *lot* of reads (like yourself) but on the new CE, each query only gives about 9,500 reads. I also get completely different plans: [2012](https://www.brentozar.com/pastetheplan/?id=H1wHLZ9g3) vs [2019](https://www.brentozar.com/pastetheplan/?id=rJ3fU-9lh) — Thom A, Mar 23 '23 at 17:03
Also, it's worth noting those queries were *not* run on similar hardware. The 2012 query (which took almost a minute and a half) ran on a instance with 32 threads and 256GB of RAM. The 2019 query, which took about ~5 seconds, ran on a server with 16GB of RAM and 8 threads. — Thom A, Mar 23 '23 at 17:07
@Thom A - A Dev database with compatibility level is 150 and Legacy Cardinality Estimator is OFF and Estimator for Secondary =PRIMARY and MAX DOP 0 — Ross Bush, Mar 23 '23 at 17:09
Hmm, this makes me more suspicious the OP *is* using the old CE then, @RossBush . That was the only way I could replicate the behaviour. — Thom A, Mar 23 '23 at 17:10
Yep, looks like you're using the *old* CE. I can replicate the behaviour on a compat 110 database on SQL Server 2022, but can't in a compat 160 database. — Thom A, Mar 23 '23 at 17:14
But querying the table multiple times *is* the fastest way if you have multiple different `DISTINCT` aggregates. It needs to work out the results on each. To be hinest it's somewhat unusual to want distinct counts over separate columns, normally you just want a `ROLLUP`. — Charlieface, Mar 23 '23 at 17:27
@Ross: `Microsoft SQL Server 2019 (RTM) - 15.0.2000.5 (X64)` with compatibility level 150 on the development machine; production is SQL Server 2014 everywhere and couldn't go higher than 2016 in any case,, because of the OS. I have not fiddled with cardinality estimator settings, but I think you nailed it when you pointed the finger in that direction ... — DarthGizka, Mar 23 '23 at 18:36
@Charlieface - It looks like batch mode hash match aggregate is capable of calculating multiple independent `COUNT(DISTINCT )` from a single input stream. At least for scalar aggregates — Martin Smith, Mar 24 '23 at 05:40

Martin Smith · Accepted Answer · 2023-03-24T05:31:30.370

2

On 2019 I get a plan with a batch mode hash aggregate operator that calculates the multiple count distincts side by side (I had missed that this was now possible TBH).

If I change the compat mode to 2014 I do get a worse plan.

And considerably worse with the n >= @threshold predicate as it still has the separate subtree for each aggregate but now loads all the rows matching the predicate into a common subexpression spool first. This is what is causing the high logical worktable reads (reads for these are rows not pages) - this runs in serial with a considerably higher elapsed time.

Is there a way to goad SQL Server towards reasonable execution plans by rewriting the query?

Assuming you are stuck on an older version then one workaround would be to do the following - to first collapse down to the 299 distinct pairs ("m13 and m23 are low-cardinality foreign keys") and then only spool those

WITH T
     AS (SELECT COUNT(*) AS Count,
                m13,
                m23
         FROM   #T
         where n >= 5000001
         GROUP  BY m13,
                   m23)
SELECT SUM(Count),
       count(DISTINCT m13),
       count(DISTINCT m23)
FROM   T

The additional time added by the subtrees aggregating the 299 spooled rows is now minimal and it doesn't really matter that the plan becomes serial to the left of the spools.

Or an alternative, and potentially better, way along similar lines would be to collapse down to the 299 and then use DENSE_RANK to count the distinct values (subtracting 1 if any nulls were present as the distinct count should not include these)

WITH T1
     AS (SELECT COUNT(*) AS Count,
                m13,
                m23,
                DENSE_RANK() OVER (ORDER BY m13) m13_rnk,
                DENSE_RANK() OVER (ORDER BY m23) m23_rnk
         FROM   #T
         where n >= 5000001
         GROUP  BY m13,
                   m23)
SELECT SUM(Count),
       m13DistinctCount = MAX(m13_rnk) - MAX(IIF(m13 IS NULL,1,0)) ,
       m23DistinctCount = MAX(m23_rnk) - MAX(IIF(m23 IS NULL,1,0))
FROM T1

The plan for this now has no spools at all

edited Mar 24 '23 at 05:31

answered Mar 23 '23 at 21:55

Martin Smith

438,706
87
741
845

1

One way to force batch mode is to create a dummy table with a clustered columnstore, then do `LEFT JOIN Dummy ON 1 = 0` anywhere in the query. The table does not appear in the final plan. – Charlieface Mar 24 '23 at 10:07
True - I've not got the time at the moment to see if 2014 is able to produce the plan with the hash aggregate if this trick is applied though. – Martin Smith Mar 24 '23 at 10:42
Thank you for your comprehensive answer. It'll take me a while to digest it fully, though. So it seems that I have to take my offending query and analyse its dozens of aggregates in order to split the query into portions that are easier to digest for the engine, and also look for bulk-reducing intermediary operations like the one you showed for a grouping that combines `m13` and `m23` (which in reality are ids for the sending software and client ids respectively; the modulo thing was only for getting the right cardinalities). – DarthGizka Mar 24 '23 at 11:03
Last but not least, the [rows vs pages](https://stackoverflow.com/a/5194902/73226) thing for the read counts was a good catch. I thought it was always pages. – DarthGizka Mar 24 '23 at 11:06
1

You could try the trick that @Charlieface mentioned above to get batch mode and see if you get the hash aggregate that does it all in one operator (I am unsure if this will work and what versions have this capability in the hash aggregate) - otherwise yes it will need some analysis and a rewrite – Martin Smith Mar 24 '23 at 11:21
@DarthGizka Sometimes you can rewrite it so that you are grouping then grouping again. For example, do you actually want the total distinct count per each column, or maybe you need to group some of your tables before joining – Charlieface Mar 24 '23 at 11:23
1

@Martin Smith: I finally got around to implementing the trick @Charlieface mentioned (i.e. creating a dummy table named `fix.BatchModeInducer` with a columnstore index and left-joining that on a false predicate into the offending query - and it does indeed work like charm! Thanks to you both! – DarthGizka Mar 25 '23 at 20:35
__HOWEVER__, on SQL Server 2014 the trick requires the Enterprise or Developer edition. Lesser editions are left out in the cold ... – DarthGizka Mar 25 '23 at 20:43

score 2 · Answer 2 · answered Mar 25 '23 at 20:41

Batch-mode aggregates are able to calculate multiple DISTINCT columns without reading the data multiple times.

To force batch-mode, the easiest way is to use a dummy clustered columnstore table

CREATE TABLE dbo.DummyCCI(dummy int);
CREATE CLUSTERED COLUMNSTORE INDEX DummyCCI ON dbo.DummyCCI;

Then you can place the following anywhere in your query

LEFT JOIN dbo.DummyCCI ON 1 = 0

The table does not appear anywhere in the query plan, it is completely elided due to the logically false condition. But it means the compiler begins to consider batch-mode aggregates, which it normally chooses as they are faster in most cases.

I can confirm that the trick works like a charm on SQL Server 2019 (even on Express). However, on SQL Server 2014 it doesn't seem to work even with editions that allow the creation of columnstore indexes, i.e. Enterprise and Developer. — DarthGizka, Mar 25 '23 at 21:01

Inexplicably huge read counts for 'Worktable' when multiple aggregates involved

2 Answers2