-1

We a have PostgreSQL 12.7 database with multiple schemas, each schema being assigned to a specific test environment.

Part of our data model evolution we had an ALTER TABLE my_table DROP COLUMN my_column SQL that needs to be executed via Flyway BD evolution tool. Our spring boot application failed to start because Flyway failed with a timeout.

Connecting to the database as a DBA I tried to manually run that DROP COLUMN statement with no result. It was taken forever. The table is a very small table with just 112 rows, you would expect the DROP statement will execute instantly. We had a very similar problem before with a DROP TABLE statement on a different PostgreSQL database (same version) and the only work around was to destroy the database and recreate it in AWS. We cannot do the same with this database instance because of the multiple test environments connected to it. It will impact everyone. I tried dropping the schema for that particular environment using DROP SCHEMA my_schema CASCADE, again I had to cancel after waiting for more than 10 minutes.

On the other hand my feeling is that is just a matter of time of when this is going to happen in PROD so we need to find a way to resolve this type of issues.

Any help would be greatly appreciated.

** UPDATE ** I ran suggested query:

SELECT pid,
       usename,
       pg_blocking_pids(pid) as blocked_by
  FROM pg_stat_activity
 WHERE cardinality(pg_blocking_pids(pid)) > 0
   AND query ='ALTER TABLE my_table DROP COLUMN my_column;';

and it came up with this result:

  pid  | usename  |               blocked_by
-------+----------+-----------------------------------------
 29688 | dba_root | {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
(1 row)

If cancelling the ALTER TABLE ... the above query brings nothing.

Julian
  • 3,678
  • 7
  • 40
  • 72

2 Answers2

1

Since the blocking process IDs show up as 0, they may be prepared transactions. Look into pg_prepared_xacts to see if there are any prepared transactions that a buggy application failed to clean up.

Laurenz Albe
  • 209,280
  • 17
  • 206
  • 263
  • You are right. Running `SELECT * FROM pg_prepared_xacts;` brings up 20 rows even now when application is stopped and it is just me connected to the database. How can I clean them up and see if it makes a difference? – Julian Mar 29 '23 at 23:28
  • 1
    `ROLLBACK PREPARED ` rollbacked all those transactions and after that I was able to alter that table. Interesting those 20 prepared transactions were quite old since 28/02/2024 and the app was working with no issue. What I don't understand is how a database restart will not clean up those. Anyway learned something new. Thanks so much!!! What a relief!!! – Julian Mar 30 '23 at 01:22
  • It is the purpose in life of a prepared transaction to survive anything, including a crash. Therefore, they are dangerous. You should not use them unless you have a reliable, crash safe distributed transaction coordinator, which you haven't. At least add monitoring for stale prepared transactions. – Laurenz Albe Mar 30 '23 at 05:48
  • So idle_in_transaction_session_timeout set to 1 day does not apply in this case. We are using Atomikos Transaction Manager which according with this link https://stackoverflow.com/questions/50511618/why-atomikos-regularly-pinged-pg-prepared-xacts it should do the monitoring (and it seems it does). But in this particular case it seems Atomikos could not match what was in the content of pg_prepared_xacts table against the content of its logs. – Julian Mar 30 '23 at 06:25
0

You can try to run this query:

SELECT blocked_locks.pid     AS blocked_pid,
         blocked_activity.usename  AS blocked_user,
         blocking_locks.pid     AS blocking_pid,
         blocking_activity.usename AS blocking_user,
         blocked_activity.query    AS blocked_statement,
         blocking_activity.query   AS current_statement_in_blocking_process
   FROM  pg_catalog.pg_locks         blocked_locks
    JOIN pg_catalog.pg_stat_activity blocked_activity  ON blocked_activity.pid = blocked_locks.pid
    JOIN pg_catalog.pg_locks         blocking_locks
        ON blocking_locks.locktype = blocked_locks.locktype
        AND blocking_locks.database IS NOT DISTINCT FROM blocked_locks.database
        AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
        AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page
        AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple
        AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid
        AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid
        AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid
        AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid
        AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid
        AND blocking_locks.pid != blocked_locks.pid
    JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
   WHERE NOT blocked_locks.granted;
Lev Gelman
  • 177
  • 8
  • I ran your query and I can see my `DROP COLUMN` and another `SELECT ... FOR NO KEY UPDATE SKIP LOCKED` locking each other. As soon as I kill the PID of `SELECT...` and I run your query again another `SELECT ... FOR NO KEY UPDATE SKIP LOCKED` with a different PID comes in my result set. I used `SELECT pg_terminate_backend(blocking pid);` to kill the blocking process. Canceling the `DROP COLUMN` statement makes your query return nothing – Julian Mar 29 '23 at 23:25
  • It's sounds that there are transactions in database, that using the same table, and locking it – Lev Gelman Mar 30 '23 at 05:21