How to manage postgreSQL 9.6 missing WAL segment from master server?

Question

A have a PITR configuration with postgresql 9.6, with a master server, a intermediate server, and two slave servers, hot-standby but manually switched, such as this:

Master
  |
 I1
 / \
S1 S2

A failure writing to disk caused the master server to crash. A partner in the development team corrected the error and then restarted the master database (instead of promoting the intermediate server, which is the established procedure). Because of this, there is a corrupt partial WAL and a whole WAL missing from the sequence.

Now, I have no transactions missing but the slave 1 as well the intermediate server complain about the missing wal, (ERROR: requested WAL segment [...] has already been removed) even as they are still updating; s2 complains as well (same as above, but preceeded by (FATAL: could not receive data from WAL stream:), and it is not updating.

Since the transactions happening when the master server went down have -already- been executed, I do not care for the missing wals. So the proper questions are:

1) How do get rid of the nagging about the missing wal? I already tried pg_resetxlog -l (next valid WAL file) -f (which does not complain anymore, but does not update) and pg_basebackup which, not surprisingly on second tought, returns to the situation described above.

2) Why is one of the slaves updating (unexpected) while the other one is not (expected)? I thought first that perhaps the updating slave was directly connecting to the master, but it is not; I have checked the configuration files and they are identical in both slaves.

Thanks for your attention

Don't ever use `pg_resetxlog` unless you have a corrupted database and want to salvage some data. rebuild your standby servers with a new `pg_basebackup` and use `restore_command`, `wal_keep_segments` or replication slots to avoid the problem in the future. — Laurenz Albe, Mar 23 '20 at 15:45
I have already tried a pg_basebackup. I thought it would reset to the new checkpoint, right? Well, it does not. What I did was: 1) Stop the slave server, rename the data directory 2) pg_basebackup -h master.server -U replica.user -D /var/lib/pgsql/9.6/data --xlog 3) Alter postgres.conf and recovery.conf as neccesary, and start the server. The server starts ok, and is updating from the master ok, but the message about the missing wal is still there, and only one of the other slaves is updating. — Aaron Rivacoba, Mar 23 '20 at 17:40
To keep the standby from falling behind, use either 1) `restore_command` on the standby or 2) `wal_keep_segments` on the primary or 3) replication slots. — Laurenz Albe, Mar 24 '20 at 07:23

How to manage postgreSQL 9.6 missing WAL segment from master server?

0 Answers0