0

Today we have encountered a performance problem with our postgresql 15 Cluster. All of a sudden today we have seen a massive performance drop starting to occur without any changes to the database servers, the client behaviour nor any software updates. The majority of the traffic to this cluster comes from data processing clients connecting to the db using the ruby library called Sidekiq. The cluster is consisting of 1 primary and a couple of streaming replicas running on dedicated HW (Dell machines with 400Gb ram, 64 cores Intel Xeon, 10 SSDs HWRAID etc.).

When monitoring the active queries using select age(now(), query_start), * from pg_stat_activity order by query_start; we can see, that from one moment to another, postgresql starts to report waiting for MultiXactOffsetSLRU and MultiXactOffsetBuffer on almost all queries, causing 100s of queries to pile up as active while effectively making our data processing come to a stand still. The queries will stay like that until they timeout after 15 minutes (which is our statement_timeout for these clients).

A few observations:

  • The issue seems to only happen for queries to a specific table of the database (cars)
  • It happens to all kinds of queries to that table - even the simplest SELECT * FROM cars WHERE id = 123 LIMIT 1 (which usually takes < 1ms).
  • There are no vacuuming etc. running at this time
  • The issue happens even at much lower client throughput than what we usually can handle just fine
  • We are seeing this for queries to that table on BOTH the primary AND the REPLICA which leads me to think that there could be something wrong with the data directory and not e.g. the server hardware... Could that really be!?

What we have tried:

  • Restart the connecting clients
  • Restart postgresql
  • Reboot servers
  • Failover postgresql to see if the problems was server specific (they are not)

Questions:

  1. How can we further debug/investigate this?
  2. What does these MultiXactOffsetSLRU and MultiXactOffsetBuffer mean? We don't recall ever really seeing them as wait_events before?

Example from pg_stat_activity:

age              | 00:04:19.050777
datid            | 16391
datname          | production
pid              | 2797495
leader_pid       | 
usesysid         | 16392
usename          | au
application_name | sidekiq
client_addr      | 10.10.30.12
client_hostname  | 
client_port      | 50006
backend_start    | 2023-05-01 19:30:30.85287+00
xact_start       | 
query_start      | 2023-05-01 19:30:31.486781+00
state_change     | 2023-05-01 19:30:31.610657+00
wait_event_type  | LWLock
wait_event       | MultiXactOffsetSLRU
state            | active
backend_xid      | 
backend_xmin     | 
query_id         | 
query            | SELECT COUNT(*) FROM "cars" WHERE "cars"."id" = $1 LIMIT 1
backend_type     | client backend

Example from pg_activity tool when the issue occurs:

enter image description here

Niels Kristian
  • 8,661
  • 11
  • 59
  • 117
  • Multixact contention is often caused by foreign keys. Usually that problem only manifests if you have too many client connections. Reduce the size of your connection pool, and the problem will go away. – Laurenz Albe May 02 '23 at 04:20
  • Thanks, however, I have been running with this amount clients for months and years, and it's usually going just fine - even though I know it's on the high side. Any idea what might cause such a sudden drastic performance drop and occurrence of these Multixact problems? – Niels Kristian May 02 '23 at 04:27
  • A change in your workload. The fact that it has been working well for a while means little. Often you just have to cross a certain threshold to get where you are, since the problem is self-propagating: waiting for the lock means a longer duration, which will make others wait for the lock even longer, and so on. – Laurenz Albe May 02 '23 at 05:55
  • Thanks again. Am I right, that you don't see any risk, that this could be caused by data corruption on the `cars` table etc. in any way? – Niels Kristian May 02 '23 at 07:26
  • 1
    This has nothing to do with data corruption. Data corruption would manifest itself by ugly error messages in the log. – Laurenz Albe May 02 '23 at 08:06
  • 1
    Do you have any prepared transactions? Those will survive a reboot. – jjanes May 02 '23 at 15:35
  • There is likely something holding this LWLock while blocking on something else. There is no way to tell directly what that might be, but look through pg_stat_activity for likely candidates. – jjanes May 02 '23 at 15:59

0 Answers0