4

Requirement:

Avoid terminating connection due to conflict with recovery error and also have acceptable replication lag.

Google Cloud PostgreSQL 9.6, Replication turned on (uses Streaming replication), PGPool-II set to only do load balancing and with following properties on slave:

work_mem    3276800
commit_delay    100
max_wal_size    940
max_standby_archive_delay   -1
max_standby_streaming_delay -1
hot_standby_feedback    on

Machine config:

vCPUs:8, Memory: 30 GB, SSD storage: 76 GB

Workload:

Master fully loaded with writes and reads, and slave also fully loaded with lots of reads. The max length of queries might be around 8-10 secs.

What we tried before:

  • Set max_standby_archive_delay and max_standby_streaming_delay to 900000 (900 secs), however we were seeing a lot of conflict errors.

  • Set max_standby_archive_delay and max_standby_streaming_delay to -1, this made the conflict errors go away, however the lag increased a lot (somewhere around 23mins)

  • Set max_standby_archive_delay and max_standby_streaming_delay to -1 and hot_standby_feedback to on. This also made the conflict errors go away, however we are still seeing replication lags (around 500 secs)

Query used for lag:

SELECT
  pg_last_xlog_receive_location() receive,
  pg_last_xlog_replay_location() replay,
  (
   extract(epoch FROM now()) -
   extract(epoch FROM pg_last_xact_replay_timestamp())
  )::int lag;

Graph of lag measured every 1 sec over a period of 9 hours:

graph

Questions:

  1. Given our use-case (Slave being actively used for read queries, how do we make sure we have no conflict errors and a reasonable lag (around few secs)
  2. What does the lag mean? Does it mean only one of the table is behind Master? Or does it mean all other WALs are also pending to be applied on Slave.
  3. If 1. is not achievable using config properties, how do we solve it in code (This is the least desirable since the code base is vast and will require lots of changes)

Thanks!

Khushbu
  • 295
  • 4
  • 14

1 Answers1

5

You cannot totally avoid conflicts — every statement like TRUNCATE or ALTER TABLE that requires an ACCESS EXCLUSIVE lock will lead to a replication conflict.

But you can avoid replication conflicts caused by VACUUM:

  • Set hot_standby_feedback = on to keep PostgreSQL from removing tuples still needed on the standby.

  • Set old_snapshot_threshold to a (possibly high) value other than the default to avoid vacuum truncation.

    This truncation requires an ACCESS EXCLUSIVE lock that can also lead to conflicts.

For the remaining conflicts, you have a choice between delayed application and query cancelation. Or you change the workload to avoid ACCESS EXCLUSIVE locks.

To find out what is blocking you, you'll have to use pg_xlogdump on the WAL files and search for ACCESS EXCLUSIVE locks. This will allow you to figure out which object is locked. To find out what kind of operation is performed, check the WAL entries immediately before (VACUUM?) or immediately afterwards (DDL?).

Laurenz Albe
  • 209,280
  • 17
  • 206
  • 263
  • Thanks @laurenz-albe! The conflicts that we are getting are mostly because of heavy reads to the rows on Replica, which are updated on the Master and not because of `VACUUM`. Surely not because of `TRUNCATE` and other operations that might cause `ACCESS EXCLUSIVE`, since we are not altering the tables at all. Also, auto vacuum is on, and we are not triggering `VACUUM` in our workload. So, how do we avoid conflicts caused by non `ACCESS EXCLUSIVE` transactions? – Khushbu Aug 22 '19 at 03:21
  • There are no such conflicts. Heavy read workloads won't delay the application of replication changes. Maybe your lag is caused by network bandwidth problems. – Laurenz Albe Aug 22 '19 at 03:58
  • Network bandwidth is not an issue, because the latency is not that high. `Heavy read workloads won't delay the application of replication changes.` But won't the delay be introduced when heavy writes happen to a table on Master and that is read heavily on Replica? To explain more clearly, we have a table that is read/written-to using lots of threads parallely. We also have PgPool, which will send all the `writes` to Master and `reads` to either Master/Slave. – Khushbu Aug 22 '19 at 06:10
  • If the machine itself is totally overloaded, applying changes may take some time. But I don't believe that, because the lag spikes build up linearly and drop so suddenly. That indicates that something is blocking replay. Check again. – Laurenz Albe Aug 22 '19 at 06:16
  • The cpu load was below 60% all the time. Can you please let us know how we can check what is blocking the replay? Thanks a lot for the help! – Khushbu Aug 22 '19 at 06:35
  • 1
    I have added some debugging tips to the answer. – Laurenz Albe Aug 22 '19 at 06:46
  • Page 100 of this book describes exactly this problem: https://manualzz.com/doc/31920832/postgresql-replication – Jasper Kuperus Jun 03 '20 at 07:47