Postgres replication falling behind - master -> slave, slave, slave

Question

I currently have 1 master server (a) and 3 replication servers (b, c, d) that come directly from the master, the archive_command I use is the following script: https://gist.github.com/Geesu/8640616

All servers are Ubuntu 12.04.4 running PostgreSQL 9.2.6

And here is the recovery.conf for each of the replication servers: https://gist.github.com/Geesu/8640635

What's strange is about 6 hours after I started the replication servers, 2 of them immediately fell behind and are now stuck trying to catch up, but they keep getting further behind. Here is how far behind they are compared to master:

a 20287.825072
b 2.521136
c 19994.51653

Does anyone have any ideas as to why one of the servers is nearly caught up completely, but the others keep falling behind? I have verified that a and c are processing the WAL segments, it's just not able to do it fast enough.

Some log examples from a and c:

cp: cannot stat `/var/lib/postgresql/9.2/archive/000000080000109E0000009A': No such file or directory
2014-01-26 23:02:14 GMT LOG:  record with zero length at 109E/9AE622D8
cp: cannot stat `/var/lib/postgresql/9.2/archive/000000080000109E0000009A': No such file or directory
2014-01-26 23:02:14 GMT LOG:  streaming replication successfully connected to primary
2014-01-26 23:03:36 GMT FATAL:  could not receive data from WAL stream: SSL error: sslv3 alert unexpected message

cp: cannot stat `/var/lib/postgresql/9.2/archive/000000080000109E000000B9': No such file or directory
2014-01-26 23:03:41 GMT LOG:  record with zero length at 109E/B9E797E0
cp: cannot stat `/var/lib/postgresql/9.2/archive/000000080000109E000000B9': No such file or directory
2014-01-26 23:03:41 GMT LOG:  streaming replication successfully connected to primary

Maybe this is related? Eventually it will get the appropriate WAL segment and be processed.

Any suggestions on how I can further debug this?

Craig Ringer · Accepted Answer · 2014-01-27T09:51:39.307

There are a few things wrong here.

WAL script

Tries to push to each server until all succeed

First, your master server is - or is attempting to - push WAL archives to each replica, and only treat the archive_command as successful if all three receive the file. That means that if one goes down, the other two will stop receiving new WAL, because Pg on the master will stay stuck retrying the same segment over and over.

Doesn't notice when it fails

Your script fails to actually return $FAIL, so in practice this doesn't work - it reports success even if all three servers failed to receive the WAL files. It will keep on plodding to the WAL file, ignoring the fact that a prior failure means the whole sequence is useless and cannot be replayed. That might well explain why the replicas cannot find the local WAL segments; they probably failed to copy due to env-var issues (RSYNC_RSH), known-hosts problems, ssh key problems, etc.

How to fix the wal sender script

I strongly recommend that you switch to a push/pull model. Have the master server push the WAL to a reliable storage location. Then have the replicas pull from that location. The master only has to copy the file once, and doesn't have to worry about retrying at different points for different servers if some are down, others are catching up, etc. An increasingly popular choice for this is Amazon S3, though Windows Azure Block Service (WABS) and OpenStack's Swift are also useful. A useful tool for this is WAL-E, which can serve as your archive_command and restore_command; it supports all of the listed stores.

Handily, if you use S3, you can have an archival policy that stashes old stuff in Glacier instead of removing it, then deletes it much later. This is a cheap and convenient way to store things like this "just in case" - but remember, it's useless unless you store the associated base backup too!

If you must have the master push to all three backends, you'll need to be much smarter about it, with a script that keeps separate track of which servers have received each WAL segment, and reports success when one server has received a segment and all previous segments. You'll need a background job that keeps retrying for servers that are behind. Even then, this is a bad idea; it'll make your WAL archiving as slow as the slowest server, which really isn't ideal and can leave the master filling up very badly. I'm sure this can be made to work, but IMO it is too fragile and complicated.

Streaming replication is broken - the SSLv3 renegotiation bug?

In your logs:

2014-01-26 23:03:36 GMT FATAL: could not receive data from WAL stream: SSL error: sslv3 alert unexpected message

Your servers are set up for streaming replication, but they're having SSL issues. This is the signature error of the sslv3 renegotiation problem, and the early brain-dead OpenSSL "fix" for it. Make sure you've updated to the latest OpenSSL patch-release, as that should fix the problem.

If it doesn't, as a work-around, you could try ssl_renegotiation_limit=0 in postgresql.conf. See this launchpad bug.

Thank you for this, b actually went down today, so this makes so much more sense. I'm going to switch to WAL-E, that is EXACTLY what we need! — Geesu, Jan 27 '14 at 15:24