SQL Server and TPC-H Table Partitioning Performance Analysis smaller partitions, fewer reads, higher cpu costs

Question

i'm using TPC-H (SF 10) on my SQL Server 2014 database system. In order to improve query performance I decided to partition (same disk) two of the biggest tables (Lineitem and Orders) by the date column, cause many of those queries use a date range. First i decided to use a weekly partition scheme and afterwards i used a monthly scheme. I used on each table a clustered column store index. I executed the first TPC-H query:

SELECT  L_RETURNFLAG,
L_LINESTATUS,
SUM(L_QUANTITY)                 AS SUM_QTY,
SUM(L_EXTENDEDPRICE)                AS SUM_BASE_PRICE,
SUM(L_EXTENDEDPRICE*(1-L_DISCOUNT))     AS SUM_DISC_PRICE,
SUM(L_EXTENDEDPRICE*(1-L_DISCOUNT)*(1+L_TAX))   AS SUM_CHARGE,
AVG(L_QUANTITY)                 AS AVG_QTY,
AVG(L_EXTENDEDPRICE)                AS AVG_PRICE,
AVG(L_DISCOUNT)                 AS AVG_DISC,
COUNT_BIG(*)                    AS COUNT_ORDER
FROM    LINEITEM
WHERE   L_SHIPDATE  <= dateadd(dd, - 94, cast('1998-12-01'as date))
GROUP   BY  L_RETURNFLAG,
    L_LINESTATUS
ORDER   BY  L_RETURNFLAG,
    L_LINESTATUS;

I got the following results for the query above.

weekly partitioning
Partitions accessed 348 (1..348) ( Total 361 partitions)
(862194 rows were not read cause they are in the last partitions)
Logical reads: 1381
Lob logical reads: 109005
Lob physical reads: 1371
Lob read-ahead reads: 200554
Execution time: 2807 ms
Compile CPU: 43
Compile Time: 43
Compile Memory: 1408
monthly partitioning
Partitions accessed 80 (1..80) ( Total 84 partitions)
(881.087 rows were not read cause they are in the last partitions)
Logical reads: 2902
Lob logical reads: 617554
Lob physical reads: 388
Lob read-ahead reads: 260486
Execution time: 2680 ms
Compile CPU: 12
Compile Time: 12
Compile Memory: 872

The biggest difference between them is the number of batches which were used. There were around 333.201 Batches used for the execution of the index scan under weekly partitioning and just 191.275 under monthly partitioning.

I'm a bit confused about this results. I expected the first execution (weekly partition) would be faster than the second one cause of fewer reads operation. The lob logical reads on the monthly partitioned tables are significantly higher, but the execution time, compile CPU, time and memory are lower. So i would think that the monthly partitioning is more efficient. The results of the other queries look almost the same :( .Could anyone help me to understand whats happening here?

So, i just did the tests once again with maxdop 1. Here are my results:

Weekly partitioning

logical reads: 1381
lob logical reads: 108619
lob physical reads: 1362
lob read-ahead reads: 200664

Monthly partitioning

logical reads: 739
lob logical reads: 94901
lob physical reads: 402
lob read-ahead reads: 262598

This is the execution plan, it looks exactly the same for both exections, here are some further details:

https://i.stack.imgur.com/293oN.png

The difference between the number of read operations isn't as big as before and there are more physical reads under weekly partitioning. Furthermore there are more logical reads under weekly partitioning. That's exactly the opposite of what i expected :/.

Execution plan, (monthly partitioning) I created a CI first and a clustered columnstore index afterwards (with drop existing = on and maxdop 1)

https://i.stack.imgur.com/Euo8m.png

Can you include the actual execution plans of both queries (I don't see the image)? I'm curious about differences in the number of threads used, assuming the plans are parallel. With all partitions on the same physical storage, it could be that too many parallel scans degrades performance. Note the higher number of physical reads. You might try the same query with `MAXDOP 1` hint. — Dan Guzman, Nov 10 '15 at 12:58

score 0 · Answer 1 · answered Nov 11 '15 at 13:48

0

I'll offer a guess here: The data in the Columnstore index probably was subject to segment elimination. Partition elimination did not add anything to that.

You could try to prevent segment elimination by using an undocumented traceflag, or by randomizing the physical order of the rows in the table before creating the CS index (select * into T2 from T1 order by newid() works in practice).

Testing CS indexes with MAXDOP 1 is meaningless because this disables batch mode which is responsible for a lot of the speedup.

answered Nov 11 '15 at 13:48

usr

168,620
35
240
369

In both cases the execution under monthly partition was faster. I have already thought of segment elimination, and i'm pretty sure there is one because i created a clustered index on the date column before i created the clustered columstore index. I will try it the way you suggested. I just want to understand whats happening here, so this seems to be a good way to figure it out. Thanks. – Nancy Nov 11 '15 at 15:15
It was marginally faster, I'd say this is in the noise range. Anything could cause such small variations such as fragmentation or a little worse segment elimination. The day version might be slower because of not full segments. You should check that as well if you want to investigate. – usr Nov 11 '15 at 16:09
Unfortunately i wont get full segments under weekly partitioning (361 partitions) because the table cardinality is just ~60.000.000. Therefore the row count per segment is around 170.000, but i thought it would be better for my query performance to have small partitions with fixed boundries – Nancy Nov 11 '15 at 17:05
OK. In general, I kind of think that 60m rows and 2.8 seconds is really slow considering this is using a columnstore index. This is suprising. Can you upload the actual execp lan of the best query you have so far? I'd like to take a look. – usr Nov 11 '15 at 17:10
That's the way I see it also. The execution of the powertest (TPC-H Benchmark) starts with an refresh statement which inserts around 60.000 rows. So there are several open row groups (delta stores). but is it possible that this has such a big impact on the query performance? I'm just reloading my database and i will post an execution plan tomorrow. Thanks! – Nancy Nov 11 '15 at 20:42
OK! Open row groups are very slow to scan. I did not think of that particular failure mode. I guess that reflects how little production experience I have with CS indexes. Ideally, you would create a suitably sorted CI and the build a CS index with DROP_EXISTING (and ideally MAXDOP 1). – usr Nov 11 '15 at 20:45
I just executed the query and ignored the refresh statement so there were no open row groups. with around 2300 ms the result is a disappointment. With a warm buffer pool the execution time is 1100 ms, but even after the execution of dropcleanbuffers and freeproccache the execution time is about 1400ms. i have the feeling it does not work properly. I just added the execution plan of this query above. – Nancy Nov 12 '15 at 09:22
Hm, I don't see anything suspicious. But I'm not sure what it's supposed to look like either. Maybe you can collect some statistics about segment elimination. We expect segment elimination to take, right? It looks like almost all rows were returned from the CS scan, so segment elimination looks like it did not do anything. Neither did partition elimination, apparently. – usr Nov 12 '15 at 12:33
This time i loaded the data differently to reach full segments so there were just a few partitions which had more than just one segment. Especially as the query is a bad example for segment elimination because of the seek predicate. The range for the date column goes from '1992-01-02' to '1998-12-01' and the seek predicate is: L_Shipdate <= '1998-08-29'. Therefore the query have to scan almost the whole index. It have to scan 80 partitions from total 84 partitions to be exact. I think it shouldn't have a big impact on the query performance. – Nancy Nov 12 '15 at 19:23

SQL Server and TPC-H Table Partitioning Performance Analysis smaller partitions, fewer reads, higher cpu costs

1 Answers1