I setup a distributed postgres database system and configured replication using hot_standby wal_level.
There is one central master database in place with multiple replicas (15 currently across the world) that are used as read only instances - so no need a failover - we just want to have the data synced to remote locations where we can read them.
master:
wal_level = hot_standby
max_wal_senders = 20
checkpoint_segments = 8
wal_keep_segments = 8
clients:
wal_level = hot_standby
max_wal_senders = 3
checkpoint_segments = 8
wal_keep_segments = 8
hot_standby = on
/var/lib/postgresql/9.4/recovery.conf on Client side:
standby_mode = 'on'
primary_conninfo = 'host=<IP of master> port=5432 user=replicator password=xxxx sslmode=require'
trigger_file = '/tmp/postgresql.trigger'
the replication starts - all seems fine for some days. after some days it seems there are no more connections accepted at the master for replication...
client:
2017-05-04 01:16:51 UTC [9608-1] FATAL: could not connect to the primary server: FATAL: sorry, too many clients already
2017-05-04 01:16:57 UTC [10807-1] FATAL: could not connect to the primary server: FATAL: sorry, too many clients already
2017-05-04 01:17:02 UTC [12022-1] FATAL: could not connect to the primary server: FATAL: sorry, too many clients already
2017-05-04 01:17:06 UTC [13217-1] FATAL: could not connect to the primary server: FATAL: remaining connection slots are reserved for non-replication superuser connections
...
master:
and then the loag is full of the messages like below - and it never recovers...
2017-05-04 08:44:14 UTC [24850-1] replicator@[unknown] ERROR: requested WAL segment 000000010000003500000014 has already been removed
2017-05-04 08:44:19 UTC [25958-1] replicator@[unknown] ERROR: requested WAL segment 000000010000003500000014 has already been removed
2017-05-04 08:44:24 UTC [27063-1] replicator@[unknown] ERROR: requested WAL segment 000000010000003500000014 has already been removed
2017-05-04 08:44:29 UTC [28144-1] replicator@[unknown] ERROR: requested WAL segment 000000010000003500000014 has already been removed
2017-05-04 08:44:34 UTC [29227-1] replicator@[unknown] ERROR: requested WAL segment 000000010000003500000014 has already been removed
2017-05-04 08:44:39 UTC [30316-1] replicator@[unknown] ERROR: requested WAL segment 000000010000003500000014 has already been removed
...
client:
2017-04-30 11:26:22 UTC [28474-1] LOG: started streaming WAL from primary at 35/14000000 on timeline 1
2017-04-30 11:26:22 UTC [28474-2] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000010000003500000014 has already been removed
2017-04-30 11:26:26 UTC [29328-1] LOG: started streaming WAL from primary at 35/14000000 on timeline 1
2017-04-30 11:26:26 UTC [29328-2] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000010000003500000014 has already been removed
2017-04-30 11:26:31 UTC [30394-1] LOG: started streaming WAL from primary at 35/14000000 on timeline 1
2017-04-30 11:26:31 UTC [30394-2] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000010000003500000014 has already been removed
...
so my question:
- is there an better way just to sync changes to remote read-only instances or do i just have errors in the configs that make my replication break after some time?