I have a table which has an id and a lock column.
create table mytable(id varchar(255),lock varchar(255), primary key (id));
create index idx_mytable_lock on mytable (lock);
I populated it with 1 million records where only the id gets a unique UUID value.
I have 8 processes that are trying to update the lock column to a unique value. The idea behind this is that those processors are consumers of this table and want to do something per id. This is the update statement.
UPDATE
mytable t
SET LOCK = ?
FROM (SELECT id
FROM mytable
WHERE LOCK IS NULL
FOR UPDATE SKIP LOCKED
LIMIT 500) AS subq
WHERE t.id = subq.id
AND t.lock is null
returning t.id
This is being executed using java and jdbc and the ? in the statement is a parameter which is set to an integer which acts as the thread id. I'm not using transactions so every statement I execute is auto committed.
If I run the 8 thread they start running and update 500 records each until all records are updated and have an integer value in the lock column. No problems.
Now I want to retry the whole thing so I run an update to set the lock column back to null
UPDATE mytable set lock = null where lock is not null;
and after that I restart the 8 processes. What I now see is that the threads are hanging.
I'm querying the postgresql database processes:
select
pid,
client_addr,
state,
query_start,
now() - query_start since,
query
from
pg_stat_activity
where state <> 'idle'
order by
5 desc;
Output:
pid | client_addr | state | query_start | since | query
-------+-------------+--------+-------------------------------+-----------------+------------------------------------------------------------------------------------------------------------------------------------------------------
19058 | 10.8.4.35 | active | 2023-05-08 12:24:41.333687+00 | 00:08:44.361053 | UPDATE +
| | | | | mytable fs +
| | | | | SET LOCK = $1 +
| | | | | FROM (SELECT id +
| | | | | FROM mytable +
| | | | | WHERE LOCK IS NULL +
| | | | | FOR UPDATE SKIP LOCKED +
| | | | | LIMIT 500 +
| | | | | ) AS subq +
| | | | | WHERE fs.id = subq.id +
| | | | | AND fs.lock is null +
| | | | | returning fs.id +
| | | | |
19057 | 10.8.4.35 | active | 2023-05-08 12:24:41.334555+00 | 00:08:44.360185 | UPDATE +
| | | | | mytable fs +
| | | | | SET LOCK = $1 +
| | | | | FROM (SELECT id +
| | | | | FROM mytable +
| | | | | WHERE LOCK IS NULL +
| | | | | FOR UPDATE SKIP LOCKED +
| | | | | LIMIT 500 +
| | | | | ) AS subq +
| | | | | WHERE fs.id = subq.id +
| | | | | AND fs.lock is null +
| | | | | returning fs.id +
.
.
.
This is repeated 8 times. So I see the processes are running. This will stay for this for at least 30 minutes (I stopped them after that).
My thoughts: there is some locking and blocking going on so I ran:
SELECT
blockeda.pid AS blocked_pid,
blockeda.state as bd_state,
now() - blockeda.query_start AS waiting,
blockinga.pid AS blocking_pid,
blockinga.state as bi_state
FROM
pg_catalog.pg_locks blockedl
JOIN pg_stat_activity blockeda ON blockedl.pid = blockeda.pid
JOIN pg_catalog.pg_locks blockingl ON(
(
(
blockingl.transactionid = blockedl.transactionid
)
OR (
blockingl.relation = blockedl.relation
AND blockingl.locktype = blockedl.locktype
)
)
AND blockedl.pid != blockingl.pid
)
JOIN pg_stat_activity blockinga ON blockingl.pid = blockinga.pid
AND blockinga.datid = blockeda.datid
WHERE
NOT blockedl.granted
AND blockinga.datname = current_database();
(taken from https://wiki.postgresql.org/wiki/Lock_Monitoring)
But it does not produce any output which means there are no processes waiting for locks. So what is it hanging on?? Then I checked postgresql and it is very busy, it is consuming CPU for all threads so it does not seem to be waiting for something.
Could it be it is hanging on some internal resource?
I'm using a postgresql database on AWS with version 13.10 and 4Gb of ram. I checked the table stats and the table is using way less that that. So it does not seem to be a limitation with memory.
Here my knowledge stops. How is it possible that this is happening? If I retry the statement with a lower number, so instead of 500 only a 100 limit then it seems to be working all the time. It also does not always happen. I also retried with a 1000 limit and that also sometimes works and sometimes not.
What seems to be a clear trigger is the update statement that updates all records to lock=null. So if the lock column is already null all works fine. If I let the 8 threads update all record with an actual lock value and then clear all of them and restart the process then it starts hanging again.
Hope someone has an explanation for this.