We're running a web application which has on average 10k active users, being served by 6 webnodes and backed by a Postgres 9.4.6.
Our monitoring tool identified the below slow running query which often has inacceptable response times, causing outages occasionally in the past days.
This is a small table, an in-house sequence implementation basically (legacy application) keeping track of unique IDs per other tables:
CREATE TABLE ids_for_records
(
tableid integer NOT NULL,
id bigint NOT NULL,
CONSTRAINT ids_for_records_pk PRIMARY KEY (tableid)
)
WITH (
OIDS=FALSE
);
This table has around ~200 records only. Our webapp nodes use this query to get a batch of IDs exclusively for themselves to be used:
UPDATE ids_for_records
SET id = id + <batchsize>
WHERE tableid = <unique-internal-table-id>
RETURNING id;
I'd need to find out why the performance of this above query dropped in the past days significantly: it's around 1sec on average, but sometimes taking 30-60secs also. During high load, all nodes are performing the exact same queries on multiple connections in parallel.
UPDATE: Querying information on locks (from pg_lock, pg_class and pg_stat_activity) being held by a single (slow) query showed that it's the exact same query from other transactions, that are being waited for. So we have simultaneous transactions trying to update (increment the ID value) the same row, thus one active with all needed locks blocks all the other ones.
The database is healthy otherwise, our operations team found no issues at all regarding storage, memory or connections; another table's size recently reached 64GB however, which might relate.
Does anyone has any idea what could cause such a performance degradation? Same load as before, but this bottleneck query is ~5× slower than before.