0

We have a fairly simple stored procedure that aggregates all the data in a large table (and then puts the results in another table). For a 5M rows table this process takes around 4 minutes - which is perfectly reasonable. But for a 13M rows table this same process takes around 60 minutes.

It looks like we are breaking some kind of a threshold and I struggle to find a simple workaround. Manually rewriting this as several "threads" aggregating small portions of table results in a reasonable run time of 10-15 minutes.

Is there a way to see the actual bottleneck here?

Update: The query plans are of course identical for both queries and they look like this:

|ROOT:EMIT Operator (VA = 3)
|
|   |INSERT Operator (VA = 2)
|   |  The update mode is direct.
|   |
|   |   |HASH VECTOR AGGREGATE Operator (VA = 1)
|   |   |  GROUP BY
|   |   |  Evaluate Grouped SUM OR AVERAGE AGGREGATE.
|   |   |  Evaluate Grouped COUNT AGGREGATE.
|   |   |  Evaluate Grouped SUM OR AVERAGE AGGREGATE.
[---//---]
|   |   |  Evaluate Grouped SUM OR AVERAGE AGGREGATE.
|   |   | Using Worktable1 for internal storage.
|   |   |  Key Count: 10
|   |   |
|   |   |   |SCAN Operator (VA = 0)
|   |   |   |  FROM TABLE
|   |   |   |  TableName
|   |   |   |  Table Scan.
|   |   |   |  Forward Scan.
|   |   |   |  Positioning at start of table.
|   |   |   |  Using I/O Size 16 Kbytes for data pages.
|   |   |   |  With MRU Buffer Replacement Strategy for data pages.
|   |
|   |  TO TABLE
|   |  #AggTableName
|   |  Using I/O Size 2 Kbytes for data pages.

Update 2:
Some statistics:

                   5M rows     13M rows
CPUTime            263,350    1,180,700 
WaitTime           577,574    1,927,399 
PhysicalReads    2,304,977   13,704,583 
LogicalReads    11,479,123   27,911,085 
PagesRead        5,550,737   19,518,030 
PhysicalWrites     131,924    5,557,143 
PagesWritten       263,640    6,103,708 

Update 3:
set statistics io,time on results reformatted for easier comparison:

                              5M rows     13M rows
+-------------------------+-----------+------------+
| #AggTableName           |           |            |
| logical reads   regular |    81 114 |    248 961 |
|                 apf     |         0 |          0 |
|                 total   |    81 114 |    248 961 |
| physical reads  regular |         0 |          2 |
|                 apf     |         0 |          0 |
|                 total   |         0 |          2 |
|                 apf IOs |         0 |          0 |
+-------------------------+-----------+------------+
| Worktable1              |           |            |
| logical reads   regular | 1 924 136 |  8 200 130 |
|                 apf     |         0 |          0 |
|                 total   | 1 924 136 |  8 200 130 |
| physical reads  regular | 1 621 916 | 11 906 846 |
|                 apf     |         0 |          0 |
|                 total   | 1 621 916 | 11 906 846 |
|                 apf IOs |         0 |          0 |
+-------------------------+-----------+------------+
| TableName               |           |            |
| logical reads   regular | 5 651 318 | 13 921 342 |
|                 apf     |        52 |         20 |
|                 total   | 5 651 370 | 13 921 362 |
| physical reads  regular |    38 207 |    345 156 |
|                 apf     |   820 646 |  1 768 064 |
|                 total   |   858 853 |  2 113 220 |
|                 apf IOs |   819 670 |  1 754 171 |
+-------------------------+-----------+------------+
| Total writes            |         0 |  5 675 198 |
| Execution time          |     4 339 |     18 678 |
| CPU time                |   211 321 |    930 657 |
| Elapsed time            |   316 396 |  2 719 108 |
+-------------------------+-----------+------------+

5M rows:

Total writes for this command: 0

Execution Time 0.
Adaptive Server cpu time: 0 ms.  Adaptive Server elapsed time: 0 ms.
Parse and Compile Time 0.
Adaptive Server cpu time: 0 ms.
Parse and Compile Time 0.
Adaptive Server cpu time: 0 ms.
Table: #AggTableName scan count 0, logical reads: (regular=81114 apf=0 total=81114), physical reads: (regular=0 apf=0 total=0), apf IOs used=0
Table: Worktable1 scan count 1, logical reads: (regular=1924136 apf=0 total=1924136), physical reads: (regular=1621916 apf=0 total=1621916), apf IOs used=0
Table: TableName scan count 1, logical reads: (regular=5651318 apf=52 total=5651370), physical reads: (regular=38207 apf=820646 total=858853), apf IOs used=819670
Total writes for this command: 0

Execution Time 4339.
Adaptive Server cpu time: 211321 ms.  Adaptive Server elapsed time: 316396 ms.

13M rows:

Total writes for this command: 0

Execution Time 0.
Adaptive Server cpu time: 0 ms.  Adaptive Server elapsed time: 0 ms.
Parse and Compile Time 1.
Adaptive Server cpu time: 50 ms.
Parse and Compile Time 0.
Adaptive Server cpu time: 0 ms.
Table: #AggTableName scan count 0, logical reads: (regular=248961 apf=0 total=248961), physical reads: (regular=2 apf=0 total=2), apf IOs used=0
Table: Worktable1 scan count 1, logical reads: (regular=8200130 apf=0 total=8200130), physical reads: (regular=11906846 apf=0 total=11906846), apf IOs used=0
Table: TableName scan count 1, logical reads: (regular=13921342 apf=20 total=13921362), physical reads: (regular=345156 apf=1768064 total=2113220), apf IOs used=1754171
Total writes for this command: 5675198

Execution Time 18678.
Adaptive Server cpu time: 930657 ms.  Adaptive Server elapsed time: 2719108 ms.
Dmitry S
  • 151
  • 5
  • not enough info to provide any specific recommendations; generally speaking: review the query plan of the query(s) doing the aggregations; is the proc called often and if so can/should it be using a different query plan based on the input parameters (ie, is the 'long run' query re-using a query plan that happens to be 'bad' for the current set of input parameters)? what do the performance metrics (pulled from MDA tables) look like? – markp-fuso Sep 03 '19 at 12:28
  • @markp, Query plan is the same. The stored proc is called once a day but it's not relevant here - the "good" and the "bad" tables are in different databases so they don't affect each other. Which performance metrics do you suggest looking at? – Dmitry S Sep 03 '19 at 12:35
  • 4 mins for 5 mil rows doesn't sound all that good, either; I'd recommend a review of the query plan, and review of the wait events and io stats; those 3 items should give a good picture of what's going on; there could also be an issue with the design of the proc and/or queries involved (a code review of the proc and queries) – markp-fuso Sep 03 '19 at 12:39
  • consider partitioning it, this way you should be able to increase the size substantially without seeing much impact on the performance. Also, try to tripe across multiple disks as much as possible. – access_granted Sep 03 '19 at 13:59
  • @markp, We are aggregating ~100 columns so 4 mins for 5 mil looks more or less reasonable. I've updated the question with the query plan - it's fairly simple (as the query itself is simple too). The design of the proc and other queries involved are not driving this performance difference - we can safely imagine it's just one single query – Dmitry S Sep 03 '19 at 15:35
  • @access_granted, As I said - we are aggregating the whole table, partitioning by itself should not do anything other than add overhead – Dmitry S Sep 03 '19 at 15:44
  • how is #AggTableName created? `select/into` or `create table`? if `create table`, does the table have any indexes on it while the data is being inserted? how many rows are being generated by the aggregation query? – markp-fuso Sep 03 '19 at 15:58
  • @markp, `create table`, no indexes, just a plain insert. ~50k rows for the "good" case, ~90k rows for the "bad" case – Dmitry S Sep 03 '19 at 16:05
  • 1
    assuming a single query does everything, the query plan looks ok (table scan, large IO, MRU, vector hashing); I'd want to look at wait times (eg, any blocking? time spent waiting for disk reads?) and disk IO stats (#physical IOs, #logical IOs, disk/io service times); rough guess would be most of the data is being pulled from (relatively) slow disks; might be interesting to see what the performance is like if you force a degree of parallelism on the query (assuming dataserver has been configured to support parallel query operations) – markp-fuso Sep 03 '19 at 16:52
  • @markp, Nothing exactly interesting in the wait times - 10 minutes wait for event 157 (wait for object to be returned to pool) and 25 minutes wait for event 29 (waiting for regular buffer read to complete), that's for the bad case of course. I've updated the question with the stats - the difference in **writes** is worrying me. What could be the reason for that? – Dmitry S Sep 03 '19 at 17:45
  • @Dmitry S. in re: partitioning - there are many ways you can slice it - your assumption that partitioned aggregation performs the same as non-partitioned may not be correct. Worth the try on your end. – access_granted Sep 03 '19 at 19:10
  • @markp, any thoughts? – Dmitry S Sep 04 '19 at 17:07
  • 1
    90K writes, even if using the worktable, should not add up to 5-6 million page writes; can't tell from what's been posted if these stats actually relate to the process; I'd recommend running the proc with instrumentation turned on: `set statistics io,time on`, `exec proc ...`, and then post the results (should be relatively small volume of output if this is a single `insert/select` statement – markp-fuso Sep 04 '19 at 18:19
  • @markp, I've updated the question with the results - skipped the stored proc altogether and just ran one single ```insert #AggTableName select ... from tableName``` – Dmitry S Sep 05 '19 at 18:27
  • @markp, 5M writes vs 0 writes - what could lead to that? – Dmitry S Sep 05 '19 at 18:38
  • @markp, 10x difference in Worktable1 physical reads is also puzzling – Dmitry S Sep 05 '19 at 18:42
  • 1
    Looks like combo of a) too few buckets for aggregate hashing which b) causes use of the #worktable but c) if tempdb is too small or HK is enabled then d) #worktable (and possibly log) pages are getting pushed to disk; see if DBA can bump up server config `max buffers per lava operator` (def: 2048, max: 65K) to increase number of hashing buckets - this should reduce usage of #worktable; tempdb tuning (increase cache size, disable HK) are a bit more involved and definitely more than can be addressed via comments – markp-fuso Sep 05 '19 at 18:52
  • 1
    keep in mind that `max buffers per lava operator` is server wide so it applies to everyone; if a larger value helps your process then you'll need to work with DBA to determine a setting/size that's good for you but also minimizes amount of memory allocated to everyone else – markp-fuso Sep 05 '19 at 18:54
  • @markp, Ah, that got me thinking. For the "good" case the whole table fits into the tempdb, for the "bad" case - it's almost twice the size of the tempdb. Could that be the main driver of the performance difference? – Dmitry S Sep 05 '19 at 20:44
  • 1
    the worktable acts as an overflow for hashing buckets; if not enough buckets you'll see a lot of reads/writes on the worktable; as the volume of overflows increases you get a (greater than linear) increase in worktable writes/reads; since worktables are stored in tempdb, you'll get a lot of physical writes as the HK flushes the worktable's 'dirty' pages to disk; the 'good' case isn't so much 'good' as it is 'less bad' than the 'bad' case – markp-fuso Sep 05 '19 at 20:51

0 Answers0