0

We have a complex query which is dynamically built depending on different options on the customer and runs a query for data. We have functions in Azure which are running these queries to build reporting data every night, we run approx. 30k of these. The queries in isolation are about as fast as I can get them, approximately 100ms but when we are running functions in parallel on the consumption plan in Azure (restricted to a maximum of 5 functions running at the same time) the performance of the queries is dropping off and some are even timing out at 5 minutes, some which are timing out I have tested in isolation and are coming in at under 100ms. There are no writes as this is using a read replica in Azure to load this data.

We are running in Postgres 11.6 on hosted Azure with PgBouncer on a VM. All these queries are going to a read replica which is configured to a 4 vCore Memory Optimized.

What changes can we make to allow more parallel execution of these queries or is scaling up our only option?

I would like to share the EXPLAIN ANALYZE but this is restricted by the business. Please let me know what information would help and I will try to provide as much as possible.

CTE Scan on bravo_zulu romeo  (cost=2151.89..2151.94 rows=1 width=204) (actual time=27.756..84.147 rows=36 loops=1)
  CTE bravo_zulu
      ->  Nested Loop  (cost=13.84..2151.89 rows=1 width=139) (actual time=27.744..84.009 rows=36 loops=1)
            ->  Nested Loop  (cost=13.42..2151.43 rows=1 width=139) (actual time=26.811..76.983 rows=36 loops=1)
                  ->  Nested Loop  (cost=12.86..130.51 rows=1 width=44) (actual time=7.471..19.361 rows=29 loops=1)
                        ->  Nested Loop  (cost=4.88..97.73 rows=1 width=24) (actual time=7.410..10.480 rows=24 loops=1)
                              ->  Index Scan using yankee on xray_zulu foxtrot_uniform  (cost=0.28..8.29 rows=1 width=8) (actual time=1.339..1.340 rows=1 loops=1)
                                      Index Cond: ("juliet" = 20)
                              ->  Bitmap Heap Scan on golf_delta hotel_six  (cost=4.60..89.43 rows=1 width=20) (actual time=6.064..9.123 rows=24 loops=1)
                                      Recheck Cond: ("delta_oscar_hotel" = foxtrot_uniform."lima")
                                      Filter: ("juliet" = ANY ('foxtrot_oscar'::integer[]))
                                      Rows Removed by Filter: 442
                                      Heap Blocks: exact=65
                                    ->  Bitmap Index Scan on papa  (cost=0.00..4.60 rows=42 width=0) (actual time=0.024..0.024 rows=466 loops=1)
                                            Index Cond: ("delta_oscar_hotel" = foxtrot_uniform."lima")
                        ->  Bitmap Heap Scan on delta_sierra_two bravo_hotel  (cost=7.98..32.76 rows=2 width=20) (actual time=0.321..0.363 rows=1 loops=24)
                                Recheck Cond: ((hotel_six."juliet" = "xray_india") OR (hotel_six."juliet" = "foxtrot_foxtrot"))
                                Filter: ("hotel_golf" = 23)
                                Rows Removed by Filter: 10
                                Heap Blocks: exact=240
                              ->  BitmapOr  (cost=7.98..7.98 rows=9 width=0) (actual time=0.066..0.066 rows=0 loops=24)
                                    ->  Bitmap Index Scan on delta_sierra_sierra  (cost=0.00..3.99 rows=5 width=0) (actual time=0.063..0.063 rows=11 loops=24)
                                            Index Cond: (hotel_six."juliet" = "xray_india")
                                    ->  Bitmap Index Scan on xray_sierra  (cost=0.00..3.99 rows=4 width=0) (actual time=0.002..0.002 rows=0 loops=24)
                                            Index Cond: (hotel_six."juliet" = "foxtrot_foxtrot")
                  ->  Index Only Scan using echo on xray_papa victor  (cost=0.56..2020.44 rows=48 width=102) (actual time=1.606..1.986 rows=1 loops=29)
                          Index Cond: (("five_lima" = 23) AND ("seven_yankee" = bravo_hotel."november") AND ("charlie_hotel" five_romeo NULL))
                          Filter: (("three" = 'charlie_romeo'::text) AND (("alpha" = 'golf_bravo'::text) OR ("alpha" = 'delta_echo'::text)) AND ((("alpha" = ANY ('mike_juliet'::text[])) AND ("mike_lima" >= 'xray_whiskey'::date) AND ("mike_lima" <= 'uniform'::date)) OR (("alpha" = ANY ('kilo'::text[])) AND ("quebec_uniform" >= 'xray_whiskey'::date) AND ("quebec_uniform" <= 'uniform'::date)) OR (("alpha" = 'quebec_alpha_quebec'::text) AND ("quebec_uniform" >= 'xray_whiskey'::date) AND ("quebec_uniform" <= 'uniform'::date) AND ("mike_lima" >= 'xray_whiskey'::date) AND ("mike_lima" <= 'uniform'::date))) AND ((("alpha" = ANY ('oscar'::text[])) AND ("seven_india" = ANY ('four'::text[]))) OR (("alpha" = ANY ('quebec_alpha_delta'::text[])) AND ("seven_charlie" = ANY ('four'::text[])))))
                          Rows Removed by Filter: 1059
                          Heap Fetches: 0
            ->  Index Scan using bravo_papa on tango sierra  (cost=0.42..0.45 rows=1 width=16) (actual time=0.194..0.194 rows=1 loops=36)
                    Index Cond: (("bravo_two" = 23) AND ("delta_tango" = six1."delta_oscar_romeo"))
  SubPlan
    ->  Result  (cost=0.01..0.02 rows=1 width=32) (actual time=0.001..0.001 rows=1 loops=36)
            One-Time Filter: ((romeo.zulu = 'golf_bravo'::text) AND (romeo.golf_uniform = 20) AND (romeo.charlie_two = 'charlie_romeo'::text))
  SubPlan
    ->  Result  (cost=0.01..0.02 rows=1 width=32) (actual time=0.000..0.000 rows=0 loops=36)
            One-Time Filter: ((romeo.zulu = 'delta_echo'::text) AND (romeo.charlie_two = 'charlie_romeo'::text) AND (romeo.golf_uniform = 20))
Planning time: 19.385 ms
Execution time: 84.373 ms

Above is an anonymised execution plan, this same query when running the functions in Azure in parallel timed out.

Table sizes are not large, largest is 8m rows but all others are low 100k.

user351711
  • 3,171
  • 5
  • 39
  • 74
  • I'm afraid that without knowing anything about the queries, the size of the dataset, how many tables are joined etc you won't be able to get anything better then guesses as answers. This is like going to a mechanic and telling them that your sportscar is slow, but you left the car at home so you can only describe it to them. – Karl-Johan Sjögren Mar 21 '21 at 10:09
  • Apologies, I will try to provide the data. – user351711 Mar 21 '21 at 10:11
  • I have added an anonymised execution plan when run in isolation – user351711 Mar 21 '21 at 10:19
  • `when we are running functions on consumption in parallel` <<-- what does this mean ? – wildplasser Mar 21 '21 at 10:41
  • I mean on the consumption plan in Azure, we are restricting to a maximum of 5 functions running at the same time. – user351711 Mar 21 '21 at 10:48
  • So, you are running this same beatiful query 30000 times ? Why? – wildplasser Mar 21 '21 at 11:28
  • Not the exact same, variations but very similar, it's dynamically built depending on the customers options, we have a lot of customers and a lot of unique reports they have configures which get data based on their sales, dates, etc. and we regenerate these as new transactions come in. All in we run 30k unique reports a night on average depending on new transactions and other criteria. I understand it is not ideal, just dealing with the hand I was dealt... – user351711 Mar 21 '21 at 11:52

1 Answers1

0

When analysing a problem like this I find it usefull to restate what we know:

  • Your queries go from taking 100ms to timing out after 5 minutes.
  • This is happening on a 4 core system that is restricted to a maximum of 5 functions running at the same time.

This sounds more like a deadlock problem than a load problem.

There are 2 things that you could try:

  • Change to run one report at a time, to see if the timeouts disappear
  • Check your SQL code for and "with update" that could be leading to database locks

Edit:

Baced on answer in comment we can assume that it is not a database lock problem

Next thing to check is the connection to the database. It could be that the system is running out of available connections to the database. Things to check:

  • Max number of connections available
  • Is connection pooling used
  • Are connections being closed /released back to the pool, when they are no longer needed
Shiraz Bhaiji
  • 64,065
  • 34
  • 143
  • 252
  • Hi Shiraz, thank you for the response, yes the timeout issue goes away if running one at a time. There are no insert/updates/deletes on any of the tables that are part of this query. I’ll acknowledge it’s a terrible design in a legacy system I’m stuck fixing but basically the data from each query is loaded into memory and transformed and then stored in another system for consumption. We need to be able to scale out as once loaded everything is in memory in each function. I’m no db expert but is it just a case of throughput/IO as each query can load a lot of rows? – user351711 Mar 21 '21 at 17:54