3

This sql query normally only takes a few minutes to run:

update import_parts ip
set part_manufacturer_id = pslc.part_manufacturer_id
from parts.part_supplier_line_codes pslc
where trim(lower(ip.line_code)) = trim(lower(pslc.supplier_line_code))
and (ip.status is null or ip.status != '6')
and ip.distributor_id = pslc.distributor_id
and ip.distributor_id = 196;

But I noticed it gets stuck sometimes and being auto-canceled by a 2 hour statement_timeout. I noticed a couple of times that when it gets stuck, autovacuum is running and autovacuum also takes a long time to finish running. Here's one instance where the update query and autovacuum is both running and they both takes too long to finish running:

auto vacuum and update query

^ In this instance, autovacuum finished running in about an hour and the update query finished running in almost 2 hours. In other instances, the update query exceeds the 2 hour statement_timeout so it gets auto-canceled.

Now my question is, is autovacuum (VACUUM) the reason why the update query is getting stuck or takes hours to finish running? If yes, what can I do to prevent the update query to be stuck or become so slow? If no, what do you think is causing the update query to be stuck or become so slow?

We're using PostgreSQL 9.6.15

UPDATE 1

I checked if our RDS instance is running out of server resources. Our instance size is db.t2.medium (2 vCPU, 4GB RAM, 1000 IOPS, storage type is provisioned iops SSD).

Here's the cloudwatch metrics for the last 3 days. Take note that in the last 3 days, the update sql query above got stuck multiple times.

cpu utilization

freable memory

write iops

UPDATE 2

The only active locks while the update query and autovacuum was running:

active locks

^ The locks highlighted with red are the locks created by autovacuum. The locks highlighted with green are the locks created by the update query.

Here's the list of all the db connections while the update query and autovacuum was running:

database connections list

The one highlighted with red is the autovacuum. The one highlighted with green is the update query.

Here's the result of EXPLAIN for the update query:

result of explain

parts.part_supplier_line_codes table only contains 2758 rows. There are no 2 or more rows on the table that has the same supplier_line_code + distributor_id.

import_parts table contains 12.6 million rows.

UPDATE 3

Here's the result of EXPLAIN (ANALYZE, BUFFERS):

EXPLAIN (ANALYZE, BUFFERS)
update import_parts ip
set part_manufacturer_id = pslc.part_manufacturer_id
from parts.part_supplier_line_codes pslc
where trim(lower(ip.line_code)) = trim(lower(pslc.supplier_line_code))
and (ip.status is null or ip.status != '6')
and ip.distributor_id = pslc.distributor_id
and ip.distributor_id = 196;

Update on import_parts ip  (cost=2967312.95..3778109.36 rows=31167172 width=156) (actual time=151475.198..151475.198 rows=0 loops=1)
  Buffers: shared hit=62369982 read=453357 dirtied=375348 written=315748, temp read=154212 written=307558
  ->  Merge Join  (cost=2967312.95..3778109.36 rows=31167172 width=156) (actual time=37567.148..84208.239 rows=10326988 loops=1)
        Merge Cond: ((btrim(lower((pslc.supplier_line_code)::text))) = (btrim(lower((ip.line_code)::text))))
        Buffers: shared hit=94397 read=78007, temp read=154212 written=307558
        ->  Sort  (cost=51.70..52.93 rows=493 width=17) (actual time=9.649..10.039 rows=494 loops=1)
              Sort Key: (btrim(lower((pslc.supplier_line_code)::text)))
              Sort Method: quicksort  Memory: 63kB
              Buffers: shared hit=7 read=11
              ->  Index Scan using index_part_supplier_line_codes_on_distributor_id on part_supplier_line_codes pslc  (cost=0.28..29.65 rows=493 width=17) (actual time=2.926..8.677 rows=494 loops=1)
                    Index Cond: (distributor_id = 196)
                    Buffers: shared hit=2 read=11
        ->  Materialize  (cost=2967261.25..3030480.67 rows=12643883 width=146) (actual time=37557.491..76400.550 rows=12642995 loops=1)
              Buffers: shared hit=94390 read=77996, temp read=154212 written=307558
              ->  Sort  (cost=2967261.25..2998870.96 rows=12643883 width=146) (actual time=37557.486..68337.525 rows=12642995 loops=1)
                    Sort Key: (btrim(lower((ip.line_code)::text)))
                    Sort Method: external merge  Disk: 1233688kB
                    Buffers: shared hit=94390 read=77996, temp read=154212 written=154212
                    ->  Seq Scan on import_parts ip  (cost=0.00..362044.24 rows=12643883 width=146) (actual time=0.027..11903.240 rows=12643918 loops=1)
                          Filter: (((status IS NULL) OR ((status)::text <> '6'::text)) AND (distributor_id = 196))
                          Buffers: shared hit=94390 read=77996
Planning time: 0.169 ms
Execution time: 151561.250 ms

UPDATE 4

Given that our AWS RDS instance has a capacity of 1000 IOPS, looks like read iops on the screenshot below is too high? Is it possible that it's the reason why sometimes the update query is getting stuck and autovacuum is taking too long to finish running?

Read IOPS:

read iops

Read Latency:

read latency

Write Latency:

write latency

Swap Usage:

swap usage

Queue Depth:

queue depth

Result of EXPLAIN (ANALYZE, BUFFERS) after rebooting our AWS RDS instance:

EXPLAIN (ANALYZE, BUFFERS)
update import_parts ip
set part_manufacturer_id = pslc.part_manufacturer_id
from parts.part_supplier_line_codes pslc
where trim(lower(ip.line_code)) = trim(lower(pslc.supplier_line_code))
and (ip.status is null or ip.status != '6')
and ip.distributor_id = pslc.distributor_id
and ip.distributor_id = 196;

Update on import_parts ip  (cost=3111484.57..3919788.11 rows=31071345 width=156) (actual time=180680.200..180680.200 rows=0 loops=1)
  Buffers: shared hit=62263174 read=712018 dirtied=386277 written=223564, temp read=237087 written=390433
  ->  Merge Join  (cost=3111484.57..3919788.11 rows=31071345 width=156) (actual time=64687.806..112959.396 rows=10326988 loops=1)
        Merge Cond: ((btrim(lower((pslc.supplier_line_code)::text))) = (btrim(lower((ip.line_code)::text))))
        Buffers: shared hit=5 read=325434, temp read=237087 written=390433
        ->  Sort  (cost=58.61..59.85 rows=493 width=17) (actual time=4.238..5.549 rows=494 loops=1)
              Sort Key: (btrim(lower((pslc.supplier_line_code)::text)))
              Sort Method: quicksort  Memory: 63kB
              Buffers: shared hit=5 read=11
              ->  Bitmap Heap Scan on part_supplier_line_codes pslc  (cost=7.40..36.56 rows=493 width=17) (actual time=2.582..3.242 rows=494 loops=1)
                    Recheck Cond: (distributor_id = 196)
                    Heap Blocks: exact=7
                    Buffers: shared read=11
                    ->  Bitmap Index Scan on index_part_supplier_line_codes_on_distributor_id  (cost=0.00..7.28 rows=493 width=0) (actual time=1.805..1.805 rows=494 loops=1)
                          Index Cond: (distributor_id = 196)
                          Buffers: shared read=4
        ->  Materialize  (cost=3111425.95..3174450.99 rows=12605008 width=146) (actual time=64683.559..105123.024 rows=12642995 loops=1)
              Buffers: shared read=325423, temp read=237087 written=390433
              ->  Sort  (cost=3111425.95..3142938.47 rows=12605008 width=146) (actual time=64683.554..96764.494 rows=12642995 loops=1)
                    Sort Key: (btrim(lower((ip.line_code)::text)))
                    Sort Method: external merge  Disk: 1233528kB
                    Buffers: shared read=325423, temp read=237087 written=237087
                    ->  Seq Scan on import_parts ip  (cost=0.00..514498.12 rows=12605008 width=146) (actual time=0.748..36768.509 rows=12643918 loops=1)
                          Filter: (((status IS NULL) OR ((status)::text <> '6'::text)) AND (distributor_id = 196))
                          Buffers: shared read=325423
Planning time: 23.127 ms
Execution time: 180803.124 ms

I rebooted the RDS instance to clear PostgreSQL's cache to see if there's a caching issue or not. I read somewhere that rebooting PostgreSQL will clear the db's cache.

John Kenn
  • 1,607
  • 2
  • 18
  • 31
  • What's the execution plan ? Are you sure you are not hitting the same problem as in your other question ? https://stackoverflow.com/q/62493528/2588800 – Svetlin Zarev Jun 24 '20 at 14:41
  • Do you run this with the same params of '6' and 196 each time? It seems like you should add `AND part_manufacturer_id is distinct from pslc.part_manufacturer_id` to avoid updating the same rows over and over again. Do the slow ones update the same number of rows as the fast ones? – jjanes Jun 24 '20 at 14:45
  • @SvetlinZarev I updated my post to show the execution plan. I think I'm not hitting the same issue because on this case, there are no 2 or more rows from `parts.part_supplier_line_codes` table that has the same `supplier_line_code + distributor_id` combination. – John Kenn Jun 26 '20 at 00:35
  • @jjanes Yes, same params of '6' and 196 each time. Do you think `AND part_manufacturer_id is distinct from pslc.part_manufacturer_id` is still needed even though there are no 2 or more rows from `parts.part_supplier_line_codes` table that has the same `supplier_line_code + distributor_id combination`? Not exactly, but the slow/stuck ones update roughly the same number of rows as the fast ones. – John Kenn Jun 26 '20 at 00:41
  • RDS instances are impossible to tune due to the inconsistent IO speeds (based on personal experience). I found we got much MUCH better performance buy running out own EC2 instances with nvme drives for blazing fast IO. – maxTrialfire Jul 07 '22 at 23:43

1 Answers1

8

Autovacuum will never block an UPDATE and vice versa. That is a fundamental design principle of VACUUM, otherwise PostgreSQL wouldn't work very well.

The only effect that autovacuum could have on the progress of an UPDATE is via the shared resources, most likely I/O. VACUUM causes I/O load, and it uses memory and CPU power. So you might want to check if any of these resources are running short when autovacuum runs. If yes, the answer would be to move to stronger hardware: it is a bad idea to slow down autovacuum because the machine is too lame; that will lead to bloat and other problems in the long run.

Laurenz Albe
  • 209,280
  • 17
  • 206
  • 263
  • I updated my post to include details about our server resources and usage metrics. – John Kenn Jun 24 '20 at 23:33
  • Very likely a row or table lock is at fault. Check for long running transactions. – Laurenz Albe Jun 25 '20 at 05:30
  • The only active locks during that time are the locks created by the update query and the autovacuum. And based on my research and what you said, their locks doesn't conflict with each other. And they are the only active queries (update and autovacuum) that time too. That's why I thought the autovacuum was the culprit. If server resources and locks by the update query and autovacuum are not the problem, any other ideas on what might be causing the issue? – John Kenn Jun 25 '20 at 08:58
  • Do you see a `wait_event` in `pg_stat_activity`? – Laurenz Albe Jun 25 '20 at 10:02
  • I updated my post to provide more details. In the screenshot of the db connections, you'll see that there's only 2 active connections (the update query and the autovacuum). And the wait_event column are all blank. – John Kenn Jun 26 '20 at 00:33
  • That query looks like it touches a lot of blocks, so I'd suspect caching issues. An `EXPLAIN (ANALYZE, BUFFERS)` (as text, not screen shot) would be instructive, particularly from a slow execution. Looks like a hosted database, so you probably cannot use `auto_explain`. – Laurenz Albe Jun 26 '20 at 01:59
  • I updated my post to show the result of `EXPLAIN (ANALYZE, BUFFERS)`. – John Kenn Jun 26 '20 at 07:28
  • The query has `shared hit=62369982 read=453357` blocks, and this is a fast run. If for some reason blocks drop out of the cache, that could explain the long duration. It need not be the `VACUUM` run, it could be something earlier that *caused* autovacuum to run. – Laurenz Albe Jun 26 '20 at 08:30
  • I guess to troubleshoot it, I could clear the cache and run the update query using EXPLAIN (ANALYZE, BUFFERS) to see if query will get stuck / take too long to run? Would configuring autovacuum to run more frequently help if ever it's really a caching issue? – John Kenn Jun 26 '20 at 12:49
  • I can't say, don't think so. If it is really correlated with autovacuum, I'd still assume that resource contention (I/O?) is the most likely suspect. – Laurenz Albe Jun 26 '20 at 13:00
  • Ok. After I troubleshoot if it's a caching issue or not, i'll try increasing our write iops capacity and see if it will fix the issue. – John Kenn Jun 27 '20 at 04:30
  • I updated my post to include more db server metrics like read iops, read/write latency, swap usage, etc... and the result of `EXPLAIN (ANALYZE, BUFFERS)` after restarting the AWS RDS instance to clear PostgreSQL's cache. Read iops looks too high given that our db only has 1000 IOPS capacity. Is it possible that it's the one that is causing the issues we're encountering? – John Kenn Jun 27 '20 at 13:24
  • Looks like I/O is the problem, yes. – Laurenz Albe Jun 29 '20 at 05:31
  • After upgrading our RDS instance size from t2.medium to t2.xlarge and increasing provisioned IOPS from 1000 to 4000, the issue still happens considering t2.xlarge has a more powerful CPU and has 16GB RAM compared to t2.medium which only has 4GB RAM. We also tried switching from t2.xlarge with 4000 provisioned IOPS to t3.large (newer generation of "T" instance family of AWS) with 1000 provisioned IOPS and upgrading PostgreSQL from 9.6 to 12.3 and the issue still happens. – John Kenn Jul 21 '20 at 09:52
  • We also already tried running autovacuum more frequently and tuning PostgreSQL by using this app as a guide - https://pgtune.leopard.in.ua/#/ but the issue still happens. Any other ideas on what's possibly causing the issue? – John Kenn Jul 21 '20 at 09:52
  • Is there a possibility that lack of server resources is not the problem? – John Kenn Aug 05 '20 at 23:16
  • Then I/O is probably not the problem, and you have something else (not autovacuum!) blocking the statement. Examine `pg_locks` when you see a hanging statement. – Laurenz Albe Aug 10 '20 at 05:56