0

We have a web service where all user facing read/write transactions are executed with isolation level SERIALIZABLE. However, we have been having problems where the transaction length of some connections goes through the roof and it seems that in majority (all?) cases the problem has been query ROLLBACK TRANSACTION. Connections stalling in this state are visible in pg_stat_activity table.

Unless I've misunderstood something, rolling back a SERIALIZABLE transaction should always be successful without any kinds of locking and should require practically no resources to execute on PostgreSQL.

What could cause SERIALIZABLE transaction to hang on ROLLBACK TRANSACTION?

We're running connections through pgbouncer version 1.7 running on localhost using pool_mode=transaction in case that makes a difference. Running select pg_terminate_backend(123) where 123 is the offending connection does not immediately free the query from client's point of view. The client only sees anything 1h after the query and then I get response:

57P01 terminating connection due to administrator command
08P01 server conn crashed?
Mikko Rantalainen
  • 14,132
  • 10
  • 74
  • 112
  • The `rollback transaction` does not fail always. Under high server load it seems to sometimes take ~ 500 ms when everything goes as expected. – Mikko Rantalainen Apr 21 '20 at 08:58
  • You should set `log_min_duration_statement` to 0 and log all statements. Then you know which statement takes how long. That would reveal what is really going on. How did you arrive at the conclusion that `ROLLBACK` takes long? – Laurenz Albe Apr 21 '20 at 09:05
  • please share more details on your system configuration (os, software versions), utilization of resources (cpu, memory, io). If cpu usage is very high, pgbouncer could become a bottleneck. – filiprem Apr 21 '20 at 09:45
  • I have a custom client log implementation that logs with timestamp close to the statement and prints another log once the statement fails. I also found the reported error codes and messages via that custom log. – Mikko Rantalainen Apr 21 '20 at 11:18
  • @filiprem pgbouncer is running on 3 redundant systems with Xeon E3-1270 v6 and 16 GB RAM and the database (streaming replication on 3 redundant systems) is running on other servers with similar CPUs and 64 GB RAM + intel 905p as storage. The connection that did hang connected to the master server so this is not hot standby related issue. Network between servers is only 1 Gbps but there's only ~200 Mbps traffic. – Mikko Rantalainen Apr 21 '20 at 11:21
  • During high load, the whole system is doing around 1000 transactions per second during high load with each transaction doing 20-2000 queries (yeah, variance is high). – Mikko Rantalainen Apr 21 '20 at 11:32
  • What does pg_stat_activity list for wait_event_type, wait_event, state ? – jjanes Apr 21 '20 at 12:50
  • @Mikko Can you reproduce the problem? Did you maybe try to remove pgbouncer from the setup and check if the problem persiststs? What was CPU usage of pgbouncer process? Is pgbouncer renice'd to avoid CPU starvation in case of overall cpu stress (it's known issue with pgbouncer and round-robin cpu scheduler)?. – filiprem Apr 21 '20 at 17:20
  • I cannot easily reproduce it but it seems to happen around once or twice a day during high load situations. It's not possible to take `pgbouncer` out of the setup because PostgreSQL would not be able to cope with all the parallel connections. I'll try to collect more data from `pg_stat_activity` the next time I see the problem. – Mikko Rantalainen Apr 21 '20 at 17:23
  • I need to check about renicing the pgbouncer process. During high load the loadavg may run in hundreds on the pgbouncer systems so CPU starvation is a real possibility. – Mikko Rantalainen Apr 21 '20 at 17:24
  • @jjanes We're running PostgreSQL 9.5 so it doesn't have all those columns. Some columns: `waiting=f`, `state=idle in transaction`, `xact_start` is nearly an hour ago, and `query_start` and `state_change` are ~100ms later so it seems that as far as PostgreSQL can see, nothing has happened for nearly an hour. – Mikko Rantalainen Apr 22 '20 at 06:54
  • pgbouncer log contains following after the problematic transaction has been killed automatically after 1 hour (I have to match by timestamps so this is best guess): `2020-04-22 09:48:36.748 2555 WARNING C-0x7f681cd61380: /@127.0.0.1:41672 Pooler Error: client_idle_timeout`. So I guess something goes wrong between PHP and pgbouncer if pgbouncer cannot see the query sent by PHP code. – Mikko Rantalainen Apr 22 '20 at 06:57
  • I also checked `pgbouncer` CPU usage, it seems to hover around 16-20% and does not increase if I `renice -n -19` it. I guess it's not CPU starvation after all. – Mikko Rantalainen Apr 22 '20 at 06:58
  • According to `pg_locks` the problematic transaction only had 28 locks where all `page` or `tuple` locks were of type `SIReadLock`. All `relation` locks were `AccessShareLock` and the only `ExclusiveLock` was for `virtualxid` `114/273884`. – Mikko Rantalainen Apr 22 '20 at 07:07
  • Any chance you can upgrade? I generally find that upgrading takes less time than working around the issues that come up when running a nearly-obsolete version. Afterall, new monitoring features are added for a reason. Although it does seem like the problem in between php and pgbouncer, not in the database itself. – jjanes Apr 22 '20 at 13:35
  • We will upgrade during next summer when the load levels are expected to be low enough to allow taking a maintenance break. – Mikko Rantalainen Apr 22 '20 at 18:15

1 Answers1

0

It was probably some yet unknown error in the old PostgreSQL release. After updating the PostgreSQL server 12.x, the error hasn't not surfaced anymore. The pgbouncer was also updated so it might have been a bug in that, too.

If anybody hits this problem in the future, upgrading both pgbounder and PostgreSQL to the latest version should fix the issue.

Mikko Rantalainen
  • 14,132
  • 10
  • 74
  • 112