0

I have a primary server that archives (and gzips) WAL files into /wal/archive/. At the moment, I'm attempting to set up a hot standby with streaming replication from a base backup.

When starting up the standby, I noticed it was producing errors such as:

could not receive data from WAL stream: ERROR:  requested WAL segment 000000010000142400000014 has already been removed

This makes sense, as it's been around a day since the base_backup was taken, and the WAL files have already been archived. I provided a restore_command in the standby server's recovery.conf to scp the file over from the primary and unzip it:

restore_command = '(set -o pipefail; scp primary_ip:/wal/archive/%f.gz /dev/stdout | pigz -d > "%p")'

Strangely, the same errors kept appearing. Keep in mind, I have tested the above command works when I run it and provide a file. I wanted to see if the command was actually being run, so I added an echo:

restore_command = '(set -o pipefail; echo "%f" >> /data/test.txt && scp primary_ip:/wal/archive/%f.gz /dev/stdout | pigz -d > "%p")'

I can clearly see it is not running the command as /data/test.txt is not being created. The postgres user has permission to write to /data/. Is there something that needs to be specified on the standby to instruct it to use restore_command when the primary has already archived a WAL file?

My recovery.conf file has been set up according to section 25.2.4 of the docs.

Leah Sapan
  • 189
  • 1
  • 2
  • 15
  • Did you remove the files from `pg_xlog` on your replication server? – sysfiend Apr 25 '16 at 08:36
  • When I restored it from the backup the `pg_xlog` only had one file in it called `archive_status`. I just tried deleting that file and the same errors still occur. It's worth noting I've also tried removing `primary_conninfo` from recovery.conf to see if I can force it to use `restore_command` but it still doesn't call it. The only settings set in recovery.conf are `restore_command`, `standby_mode` and `trigger_file`. – Leah Sapan Apr 25 '16 at 14:35
  • When I start it up, it says `entering standby mode`, `incomplete startup packet`, then `the database system is starting up` 10-15 times, and then `incomplete startup packet` again. – Leah Sapan Apr 25 '16 at 14:36
  • You only ned to remove the files from `pg_xlog`, not the folder. – sysfiend Apr 25 '16 at 14:45
  • Correct, I didn't remove the folder. I tried removing the `archive_status` file inside as a last ditch effort when it wasn't working. – Leah Sapan Apr 25 '16 at 14:46
  • Same, don't remove anything there. Basically, you need to do `rm -f pg_xlog/*`. Anyways, I'd check the config file once again following the manual step by step, check for network connectivity issues and then redoing it again from zero. – sysfiend Apr 25 '16 at 14:49
  • @Alex there's no reason the primary would need `restore_command` set somewhere, correct? In case it could somehow hit the archive and stream it over to the secondary. – Leah Sapan Apr 25 '16 at 16:48
  • nope, `restore_command` is only needed on the "slave", you only need to edit `postgresql.conf` on the primary server. – sysfiend Apr 26 '16 at 07:44

0 Answers0