Postgresql streaming replication with restore_command for archived WAL files

Question

I have a primary server that archives (and gzips) WAL files into /wal/archive/. At the moment, I'm attempting to set up a hot standby with streaming replication from a base backup.

When starting up the standby, I noticed it was producing errors such as:

could not receive data from WAL stream: ERROR:  requested WAL segment 000000010000142400000014 has already been removed

This makes sense, as it's been around a day since the base_backup was taken, and the WAL files have already been archived. I provided a restore_command in the standby server's recovery.conf to scp the file over from the primary and unzip it:

restore_command = '(set -o pipefail; scp primary_ip:/wal/archive/%f.gz /dev/stdout | pigz -d > "%p")'

Strangely, the same errors kept appearing. Keep in mind, I have tested the above command works when I run it and provide a file. I wanted to see if the command was actually being run, so I added an echo:

restore_command = '(set -o pipefail; echo "%f" >> /data/test.txt && scp primary_ip:/wal/archive/%f.gz /dev/stdout | pigz -d > "%p")'

I can clearly see it is not running the command as /data/test.txt is not being created. The postgres user has permission to write to /data/. Is there something that needs to be specified on the standby to instruct it to use restore_command when the primary has already archived a WAL file?

My recovery.conf file has been set up according to section 25.2.4 of the docs.

Did you remove the files from `pg_xlog` on your replication server? — sysfiend, Apr 25 '16 at 08:36
When I restored it from the backup the `pg_xlog` only had one file in it called `archive_status`. I just tried deleting that file and the same errors still occur. It's worth noting I've also tried removing `primary_conninfo` from recovery.conf to see if I can force it to use `restore_command` but it still doesn't call it. The only settings set in recovery.conf are `restore_command`, `standby_mode` and `trigger_file`. — Leah Sapan, Apr 25 '16 at 14:35
When I start it up, it says `entering standby mode`, `incomplete startup packet`, then `the database system is starting up` 10-15 times, and then `incomplete startup packet` again. — Leah Sapan, Apr 25 '16 at 14:36
You only ned to remove the files from `pg_xlog`, not the folder. — sysfiend, Apr 25 '16 at 14:45
Correct, I didn't remove the folder. I tried removing the `archive_status` file inside as a last ditch effort when it wasn't working. — Leah Sapan, Apr 25 '16 at 14:46
Same, don't remove anything there. Basically, you need to do `rm -f pg_xlog/*`. Anyways, I'd check the config file once again following the manual step by step, check for network connectivity issues and then redoing it again from zero. — sysfiend, Apr 25 '16 at 14:49
@Alex there's no reason the primary would need `restore_command` set somewhere, correct? In case it could somehow hit the archive and stream it over to the secondary. — Leah Sapan, Apr 25 '16 at 16:48
nope, `restore_command` is only needed on the "slave", you only need to edit `postgresql.conf` on the primary server. — sysfiend, Apr 26 '16 at 07:44

Postgresql streaming replication with restore_command for archived WAL files

0 Answers0