2

I'm running master & replica on PG 13.3. I decided to use delayed replication (30 minutes configured in recovery_min_apply_delay parameter). On top of that, WAL archiving is configured and working well.

When load on master is very high for a long time, it happens that replication is falling behind until max_slot_wal_keep_size is exceeded (see my another, related question: Replication lag - exceeding max_slot_wal_keep_size, WAL segments not removed). Once it falls too far behind, the slot is "lost' and replica falls back to restoring WAL from the archive. So far so good. The problem is, it never tries replication again. Restarting slave does not help. There are two ways how I managed to restore the replication:

  1. Restarts & config edits
  • Remove the delay config from the replica
  • Restart postgres. Then it restores all the WAL from archive and once there's nothing left it will start replication again - but without any delay. Then I edit config again to introduce replication and it sometimes works, sometimes doesn't. I think it depends on the load.
  1. Removing a WAL segment from archive
  • Look at currently restored WAL segments from the postgresql log and temporarily move the following one from the WAL archive. When PG tries to recovery it fails and falls back to replication

This doesn't seem like the right way to do it, does it?

Thanks,

-- Marcin

Marcin Krupowicz
  • 536
  • 6
  • 16

1 Answers1

0

As far as I can see, this is a non-problem.

If you want replication delayed by 30 minutes, and you archive more than one 16MB WAL segment per half hour, there is no need to replicate. The information can just as well be read from the archive. If the latest entry in the latest archived WAL segment happens to be older than recovery_min_apply_delay, the standby will contact the primary and replicate.

If you insist on replication rather than archive recovery, remove restore_command and max_slot_wal_keep_size from the configuration. But I don't see the point.

If you are concerned about losing the active WAL segment in case of a catastrophe on the primary, use pg_receivewal rather than archive_command to populate the WAL archive.

Laurenz Albe
  • 209,280
  • 17
  • 206
  • 263
  • As far as I can tell the main difference in my case is: restore_command is only called when log is being needed to apply, whereas replication will stream WAL as they happen, but apply later. restore_command can therefor cause a larger data loss (up to 16MB, although in my case it is 256MB). I use max_slot_wal_keep_size because under no circumstances I don't want the slot to kill master. – Marcin Krupowicz Jul 13 '21 at 11:18
  • Use `pg_receivewal`, as indicated in my extended answer. – Laurenz Albe Jul 13 '21 at 11:31
  • Yes, I could do that, although it is yet another process to worry about. I find it surprising that PG would not try to reestablish the replication, as a preferred option of keeping the replica running. Restoring from archive is worse in my case from the reasons already described, as well as putting more strain on the archive storage (NFS). I wanted it to used only when replication fails, but no longer that it's necessary. – Marcin Krupowicz Jul 13 '21 at 12:17
  • @MarcinKrupowicz did you find a solution to reestablish wal transfer using streaming replication only? or archive command is the best one? – sh4rkyy May 30 '23 at 14:53