0

Is there any way we could stop replication without logging into psql shell. Disk-full situation lead to some corruption in PG files and keep on restarting.

2023-02-06 08:17:54 UTC [1] LOG:  starting PostgreSQL 13.7 (Ubuntu 13.7-1.pgdg20.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0, 64-bit
2023-02-06 08:17:54 UTC [1] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2023-02-06 08:17:54 UTC [1] LOG:  listening on IPv6 address "::", port 5432
2023-02-06 08:17:54 UTC [1] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2023-02-06 08:17:54 UTC [8] LOG:  database system was shut down at 2023-02-06 08:17:45 UTC
2023-02-06 08:17:54 UTC [8] PANIC:  could not open file "pg_replslot/slot_name/state": No such file or directory
2023-02-06 08:17:55 UTC [1] LOG:  startup process (PID 8) was terminated by signal 6: Aborted
2023-02-06 08:17:55 UTC [1] LOG:  aborting startup due to startup process failure
2023-02-06 08:17:55 UTC [1] LOG:  database system is shut down

Tried removing pg_replslot/slot_name which lead to "password auth failure" and After resetting DB password(via pg_hba.conf) DB is not showing up !

Is there any proper way to recover in this state? /pg/main files and pgdata directories seem to be available except this slot information.

Done below steps:

  • I'm using PSQL docker container.
  • disk used for PG got full. Cleaned up some log files and docker system prune was used to remove unused images which freed some space. But lead to this issue.
  • Multiple times, we have seen similar issue in Dev environments, Disk full leading to some corrupted files (unable to read/ No such file or directory) kind of errors.
  • Tried removing pg_replslot/slot_name directory and it allowed me to start PSQL container.(previously is was keep on restarting container)
  • Reset password by using trust in auth column in pg_hbda.conf.
  • Now \l in psql shell showing only postgres DB and default DB's. Not showing our custom DB.
  • We have main DB in a separate tablespace and is not showing up in the list.

_ MOST importantly, Standby is also having SAME errors ! Probably someone messed it?

Anto
  • 3,128
  • 1
  • 20
  • 20
  • I don't think that a simple disk-full condition leads to this. You have to give us more information: 1) is this the primary or the standby server? 2) What *exactly* did you do after the disk was full and the database crashed? Be as detailed as possible. – Laurenz Albe Feb 06 '23 at 09:05
  • @LaurenzAlbe Added more details. docker system prune to remove unused images and this is on primary. But, I think it's easy to reproduce the similar corruption situation by completely utilizing the disk – Anto Feb 06 '23 at 09:51
  • Thanks. You write "cleaned up some log files". Can you give me details as to which log files in which directory? – Laurenz Albe Feb 06 '23 at 10:08
  • @LaurenzAlbe it's some custom app logs. non PG files. – Anto Feb 06 '23 at 10:36
  • Thanks. I don't know about docker. Looking at the manual page I see that `docker system prune` should not modify any volumes, but seemingly just that happened, as evidenced by the missing `pg_replslot/slot_name/state`. At this point, you should restore your backup. For the future: the correct way to deal with "out of space" conditions is to increase the space. – Laurenz Albe Feb 06 '23 at 11:27
  • @LaurenzAlbe Can we recover any data based on tablespace files? Current error was from /pg/xx replication slot named file. – Anto Feb 06 '23 at 11:27
  • Only an expert could recover the data now. This beyond a Stackoverflow answer. – Laurenz Albe Feb 06 '23 at 11:28

0 Answers0