Requirement:
Avoid terminating connection due to conflict with recovery
error and also have acceptable replication lag
.
Google Cloud PostgreSQL 9.6, Replication turned on (uses Streaming replication), PGPool-II set to only do load balancing and with following properties on slave:
work_mem 3276800
commit_delay 100
max_wal_size 940
max_standby_archive_delay -1
max_standby_streaming_delay -1
hot_standby_feedback on
Machine config:
vCPUs:8, Memory: 30 GB, SSD storage: 76 GB
Workload:
Master fully loaded with writes
and reads
, and slave also fully loaded with lots of reads
.
The max length of queries might be around 8
-10
secs.
What we tried before:
Set
max_standby_archive_delay
andmax_standby_streaming_delay
to900000
(900 secs), however we were seeing a lot ofconflict
errors.Set
max_standby_archive_delay
andmax_standby_streaming_delay
to-1
, this made the conflict errors go away, however the lag increased a lot (somewhere around23mins
)Set
max_standby_archive_delay
andmax_standby_streaming_delay
to-1
andhot_standby_feedback
toon
. This also made the conflict errors go away, however we are still seeing replication lags (around500 secs
)
Query used for lag:
SELECT
pg_last_xlog_receive_location() receive,
pg_last_xlog_replay_location() replay,
(
extract(epoch FROM now()) -
extract(epoch FROM pg_last_xact_replay_timestamp())
)::int lag;
Graph of lag measured every 1 sec over a period of 9 hours
:
Questions:
- Given our use-case (Slave being actively used for read queries, how do we make sure we have no conflict errors and a reasonable lag (around few secs)
- What does the lag mean? Does it mean only one of the table is behind Master? Or does it mean all other WALs are also pending to be applied on Slave.
- If 1. is not achievable using config properties, how do we solve it in code (This is the least desirable since the code base is vast and will require lots of changes)
Thanks!