TL;DR : PG taking long to shut down and I am unable to find the root cause or reproduce the issue.
PG major version : PG 13.
Full issue : For an operation, our workflow performs a couple of checkpoints before issuing the SIGINT to shutdown the DB. Once the shutdown with SIGINT(fast shutdown) is issued, I see another checkpoint happening on the PG instance in question. Following this, I notice the instance is still not able to shut down completely for ~4hours. By this, I mean that I don't see the engine log "database system is shut down" which is generally the case when it has been successfully shut down. After the checkpoint completed successfully, all that I am seeing in the logs is the below in loop for the 4 hours.
connection received: host=<> port=<>
the database system is shutting down(ProcessStartupPacket)
I believe this log is from client apps trying to connect and being refused connections since a SIGINT was issued and is not indicative of real reason of stuck shutdown.
I am trying to understand what could have prevented PG from shutting down ? This being a critical server, I am constrained by not being able to turn on log_min_messages
to 'DEBUG5
' and attempt another shutdown to see it go into a similar fate. On the other hand, I am not sure how I can repro this issue in my environment.
As a long shot, I "assumed" if something was going on with archiving that could have caused this. But even by running pgbench with 10 connections for a significant amount of time with inserts, updates and long running queries, I am not able to repro a slow shutdown.
Another aspect that I was considering exploring was to accumulate a lot of WAL files to see if archiving could indeed be the reason. But the pgbench experiment did not help much with that. Is there a way by which I can accumulate a lot of WAL files ( tried increasing the checkpoint_timeout
to the max possible value, but did not help).
To summarize, below are the questions I am looking help with :
- Beyond the logs, how can I find out why the shutdown took long?
- Any suggestion on how I can repro this ?
- How can I consider accumulating a lot of WAL files in my test server ?