Slow replication recovery due to communication problems

Question

We had lately several times the same problems on Google compute engine environment with PostgreSQL streaming replication and I would like to understand reasons and if I can repair it in some smoother way.

From time to time we see some communication problems in Google's internal network in GCE datacenter and they always trigger replication lags between our PG master and its replicas. All machines are Debian-8 and PostgreSQL 9.5.

When situation happens everything seems to be OK - no errors in PG logs on master or replicas just communication between master and replicas seems to be incredibly slow or repeatedly failing so new WAL logs are transfered to replicas with big delays and therefore replication lag is still growing.

Restart of replication from within PostgreSQl or restart of PostgreSQL on replica does not really help - after several WAL logs copied using scp in recovery command communication is back in previous incredibly slow status. Only restart of the whole instance help. When whole VM is restarted communication is back to normal and recovery even from lag many hours long is done in a few minutes. So main reason for this behavior seems to be on OS level. I tried to check net traffic but without finding anything significant. I also do not see anything relevant in any OS log.

Could restart of some OS service help? So I do not need to restart the whole VM? Thank you very much for any ideas.

From the problem description and resolution step(VM restart), it does not appears to be problem with the network. You may wants to moniotor if resources( eg: CPU, memory) to verify if they are not overutilized. Also, it possible to be problem with PostgreSQL configuration. Refer to this [thread](https://dba.stackexchange.com/questions/190762/high-delay-lag-between-master-slave-in-postgres-9-3). You can try posting on [dba stackexchane](https://dba.stackexchange.com/) for better insight from other DB admins to troubleshoot this. — N Singh, Aug 17 '18 at 21:59
Hi, thanks I will check recommended thread. But problem is that this situation happens on our both replicas at the same time. Each replica is in different zone and restart of master does not repair it. Only restart of the whole VM of replica helps. And since we had at the same time other problems with communication - on Google Container Engine we presume it is caused by network too. — JosMac, Aug 20 '18 at 07:27
To verify the network issue, try [mtr tool](https://www.linode.com/docs/networking/diagnostics/diagnosing-network-issues-with-mtr/) and [iperf](https://www.thegeekdiary.com/how-to-use-iperf-to-test-network-performance-in-linux/), this will help to narrow down the troubleshooting. If the problem appears to be Google network, feel free to submit [issue report] providing all the related details and network debugging. — N Singh, Aug 20 '18 at 18:42

Slow replication recovery due to communication problems

0 Answers0