distributed postgres deployment out of sync

Question

I setup a distributed postgres database system and configured replication using hot_standby wal_level.

There is one central master database in place with multiple replicas (15 currently across the world) that are used as read only instances - so no need a failover - we just want to have the data synced to remote locations where we can read them.

master:

wal_level = hot_standby
max_wal_senders = 20
checkpoint_segments = 8    
wal_keep_segments = 8

clients:

wal_level = hot_standby
max_wal_senders = 3
checkpoint_segments = 8    
wal_keep_segments = 8 
hot_standby = on

/var/lib/postgresql/9.4/recovery.conf on Client side:

standby_mode = 'on'
primary_conninfo = 'host=<IP of master> port=5432 user=replicator password=xxxx sslmode=require'
trigger_file = '/tmp/postgresql.trigger'

the replication starts - all seems fine for some days. after some days it seems there are no more connections accepted at the master for replication...

client:

2017-05-04 01:16:51 UTC [9608-1] FATAL:  could not connect to the primary server: FATAL:  sorry, too many clients already
2017-05-04 01:16:57 UTC [10807-1] FATAL:  could not connect to the primary server: FATAL:  sorry, too many clients already
2017-05-04 01:17:02 UTC [12022-1] FATAL:  could not connect to the primary server: FATAL:  sorry, too many clients already
2017-05-04 01:17:06 UTC [13217-1] FATAL:  could not connect to the primary server: FATAL:  remaining connection slots are reserved for non-replication superuser connections
...

master:

and then the loag is full of the messages like below - and it never recovers...

2017-05-04 08:44:14 UTC [24850-1] replicator@[unknown] ERROR:  requested WAL segment 000000010000003500000014 has already been removed
2017-05-04 08:44:19 UTC [25958-1] replicator@[unknown] ERROR:  requested WAL segment 000000010000003500000014 has already been removed
2017-05-04 08:44:24 UTC [27063-1] replicator@[unknown] ERROR:  requested WAL segment 000000010000003500000014 has already been removed
2017-05-04 08:44:29 UTC [28144-1] replicator@[unknown] ERROR:  requested WAL segment 000000010000003500000014 has already been removed
2017-05-04 08:44:34 UTC [29227-1] replicator@[unknown] ERROR:  requested WAL segment 000000010000003500000014 has already been removed
2017-05-04 08:44:39 UTC [30316-1] replicator@[unknown] ERROR:  requested WAL segment 000000010000003500000014 has already been removed
...

client:

2017-04-30 11:26:22 UTC [28474-1] LOG:  started streaming WAL from primary at 35/14000000 on timeline 1
2017-04-30 11:26:22 UTC [28474-2] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 000000010000003500000014 has already been removed
2017-04-30 11:26:26 UTC [29328-1] LOG:  started streaming WAL from primary at 35/14000000 on timeline 1
2017-04-30 11:26:26 UTC [29328-2] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 000000010000003500000014 has already been removed
2017-04-30 11:26:31 UTC [30394-1] LOG:  started streaming WAL from primary at 35/14000000 on timeline 1
2017-04-30 11:26:31 UTC [30394-2] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 000000010000003500000014 has already been removed
...

so my question:

is there an better way just to sync changes to remote read-only instances or do i just have errors in the configs that make my replication break after some time?

if you dont use archive_command (from your recovery.conf) to save WALs, and you dont have `000000010000003500000014` indeed, you have to rebuild replication — Vao Tsun, May 04 '17 at 10:31
please decrease the number of questions to one essential, correlating to your log samples or config... — Vao Tsun, May 04 '17 at 10:32
what do you mean with the below? if you dont use archive_command (from your recovery.conf) to save WALs, and you dont have 000000010000003500000014 — roegi, May 04 '17 at 10:47
I mean slave could not connect to master for long enough time that it erased WAL you need to go on with replication — Vao Tsun, May 04 '17 at 10:48
it could not connect due to "could not connect to the primary server: FATAL: sorry, too many clients already" - so is this a wrong setup i am running here? or just a too low connection limit for 15 replicas? — roegi, May 04 '17 at 10:51

score 1 · Accepted Answer · answered May 04 '17 at 11:22

https://www.postgresql.org/docs/current/static/runtime-config-replication.html:

max_wal_senders (integer)

Specifies the maximum number of concurrent connections from standby servers or streaming base backup clients (i.e., the maximum number of simultaneously running WAL sender processes). The default is zero, meaning replication is disabled. WAL sender processes count towards the total number of connections, so the parameter cannot be set higher than max_connections. Abrupt streaming client disconnection might cause an orphaned connection slot until a timeout is reached, so this parameter should be set slightly higher than the maximum number of expected clients so disconnected clients can immediately reconnect.

(emphasis mine). Either applications connecitons or orphan connections leaded to your

FATAL: sorry, too many clients already

You might want to use some connection pooler for applications, like pgbouncer, restricting too many connections before they actually happened.

Answering your question if you have archive_command set to actually copying WAL somewhere, modify restore_command in recovery.conf on slave to pick them up. It will allow slave to catch up from the moment it lost stream. Otherwise you have to rebuild it.

thx a lot! i will try to increase max_wal_senders to a higher number. further, i would like to get the archive_command and the restore_command working - do you have an example on what needs to be done? — roegi, May 04 '17 at 13:46

distributed postgres deployment out of sync

1 Answers1