0

We have master-slave replication on

PostgreSQL 9.4.9
CentOS 6.8

and till today we've had beautiful time with our replication between our two +- identical servers. But today I ran VACUUM FULL (on the master of course) which destroyed replication (as expected), but that was not supposed to be a big deal as we have "turned" the replication off and back on for so many times before. But this time it was different.

After executing our many-times-proved script (basically pg_start_backup(), full rsync of data/ directory (with some conf exludes) and pg_stop_backup()), the synchronization looked ok, but the slave DB has become no longer (RO-) accessible by psql. The error reads:

psql: FATAL: could not open file "global/12745": No such file or directory

After a couple of re-runs I gave up and created empty global/12745 to see what's going to happen, but instead I am always getting

psql: FATAL: role "postgres" does not exist

Actually it seems, that no role we have on the master exists for the slave DB, and this is still true even after disabling replication. So now, I have no idea how even to access the slave database. At the same time, the master DB has no such issue, and "postgres" (or any other user we have) is functioning there perfectly.

I did many attempts, including complete removal of /var/lib/pgsql/9.4 directory and reinstall of rpms with initdb. (Fresh empty DB works fine on the slave.)

What could have gone wrong? Have my primary DB became somehow "non-replicable" anymore? That'd be pitty, as this is our primary mean of backup.

Any help is greatly appreciated. Thanks a lot.

  • Well I've noticed, that fresh empty initdb-ed DB has `data/global` directory with files named like 12??? (including wanted `global/12745`), but my master's DB has only set of `global/6????`'s, created during `VACUUM FULL`, which are transported over via rsync to the slave. And while master is completely fine with that, slave is complaining, and it wants to have its files back. What am I missing? Or what slave is missing, how to tell it, that the files have changed and it needs to adapt? – mkwinco Oct 05 '16 at 19:06
  • We've noticed one more thing: when starting up the slave DB (e.g. after rsync), file `PG_DATA/global/pg_internal.init` is deleted. Even when I try to copy it manually from master, it is always deleted after DB restart. But on the master it lives happily, the restart of the master DB just changes its timestamp, size stays the same (no sure about content). But nevertheless we still have no idea about root cause and how to fix it. – mkwinco Oct 06 '16 at 14:54
  • Well, after two days playing around nothing worked. Then I left it for a day as it was and went preparing a brand new server for dump-restore. My collegue wanted to take a look for one more thing, he started the replication (the very same way as I did at least 10 times before him), and all of the sudden it worked. So now it's working just fine, although we have no faintest idea what went wrong and why it is now ok. However we'd like to avoid such nasty surprises next time, so any insight what could that have been is also appreciated. – mkwinco Oct 08 '16 at 08:08
  • Were you able to resolve this? Were you replicating(The `pg_start_backup()` and the following) with replica server stopped or was it up? – Deepak Puthraya Nov 03 '16 at 04:00

0 Answers0