-1

We're running a web application which has on average 10k active users, being served by 6 webnodes and backed by a Postgres 9.4.6.

Our monitoring tool identified the below slow running query which often has inacceptable response times, causing outages occasionally in the past days.

This is a small table, an in-house sequence implementation basically (legacy application) keeping track of unique IDs per other tables:

CREATE TABLE ids_for_records
(
  tableid integer NOT NULL,
  id bigint NOT NULL,
  CONSTRAINT ids_for_records_pk PRIMARY KEY (tableid)
)
WITH (
  OIDS=FALSE
);

This table has around ~200 records only. Our webapp nodes use this query to get a batch of IDs exclusively for themselves to be used:

UPDATE ids_for_records
SET id = id + <batchsize>
WHERE tableid = <unique-internal-table-id>
RETURNING id;

I'd need to find out why the performance of this above query dropped in the past days significantly: it's around 1sec on average, but sometimes taking 30-60secs also. During high load, all nodes are performing the exact same queries on multiple connections in parallel.

UPDATE: Querying information on locks (from pg_lock, pg_class and pg_stat_activity) being held by a single (slow) query showed that it's the exact same query from other transactions, that are being waited for. So we have simultaneous transactions trying to update (increment the ID value) the same row, thus one active with all needed locks blocks all the other ones.

The database is healthy otherwise, our operations team found no issues at all regarding storage, memory or connections; another table's size recently reached 64GB however, which might relate.

Does anyone has any idea what could cause such a performance degradation? Same load as before, but this bottleneck query is ~5× slower than before.

skarfiol
  • 131
  • 10
  • What is a `batch of IDs`? Are you performing row-at-a-time processing from your application? – wildplasser Feb 27 '19 at 22:27
  • Each of the data tables has pre-defined batchSizes in our application, which are being used when retrieving the IDs. Once the above query returns, the webapp node can exclusively use that retrieved range (Retval - batchSize) for assigning new IDs. Then most of the time these are used for individual, per-record INSERTs. – skarfiol Feb 28 '19 at 13:13
  • Do other tables have Foreign Keys to this `id` field? Do they have indexes supporting these FKs? Are there any triggers involved? – wildplasser Feb 28 '19 at 19:21
  • There are no Foreign keys set to the id field, so no indexes either. No triggers involved either. – skarfiol Mar 01 '19 at 13:11

1 Answers1

0

The following contains some guesswork, please correct me where I guessed wrong.

The query itself will be fast, unless something very strange is going on. What makes it take long is waiting for the row lock.

Locks are held for the whole duration of the transaction, so it is probably the processing of the batch in the same transaction as the UPDATE statement that is blocking concurrent sessions.

The solution is to use sequences. Since you built your solution in a modular fashion using one central function, it should not be too hard to fix the problem.

The challenge is to get whole batches of sequence values. You can do that in a safe way by protecting a setval with advisory locks, which can be released before the transaction ends. See my blog post for an example how that could be done.

Laurenz Albe
  • 209,280
  • 17
  • 206
  • 263
  • I'm afraid something very strange is going on also :) I've updated the post with information on locks. We are executing this above UPDATE query solely in its own transaction, just to retrieve a number, which the webapps can use to assign to future to-be-inserted records. Maybe my terminology of 'batch' was misleading, I just meant a logical ID range on that, not batched Inserts in the same transaction. I hope it clarifies. – skarfiol Feb 28 '19 at 13:25
  • Also thank you for the suggestion, we've also found that sequences would be better for this purpose, but for now we can't modify / redesign the code, it's an issue in production which we'd like to fix quickly. – skarfiol Feb 28 '19 at 13:30
  • 2
    All you can see from that locks query is that the process is blocked by somebody else - you conveniently didn't include the column that shows *who* is blocking. – Laurenz Albe Feb 28 '19 at 14:00
  • Updated the post with the blokcing information - turned out it's the same query. – skarfiol Mar 01 '19 at 13:17
  • So my guess was correct. Hard to say what makes the problem surface all of a sudden - maybe more concurrency or slower operations elsewhere (like a longer query in the same transaction that takes longer because the tables have grown), so that locks are held longer. Beyond a certain threshold, things escalate. Avoid long transactions and locks. – Laurenz Albe Mar 01 '19 at 13:31