0

Our logical replication has started to fail with a lot of these errors on the subscriber:

ERROR:  could not create replication slot 
"pg_148211637_sync_148178452_7161337762731575223": ERROR:  all replication slots are in use
May  2 07:43:01 psql-06 postgres[1195903]: [2]  HINT:  Free one or increase max_replication_slots.
May  2 07:43:01 psql-06 postgres[1181479]: [1] LOG:  background worker "logical replication worker" (PID 1195903) exited with exit code 1
May  2 07:43:01 psql-06 postgres[1195905]: [1] LOG:  logical replication table synchronization worker for subscription "sub_production", table "products" has started
May  2 07:43:01 psql-06 postgres[1195901]: [1] LOG:  could not drop replication slot "pg_148211637_sync_148183118_7161337762731575223" on publisher: ERROR:  replication slot "pg_148211637_sync_148183118_7161337762731575223" does not exist

It seems like it's constantly trying drop the replication slot pg_148211637_sync_148183118_7161337762731575223 on the publisher, but somehow it does not exists?

How can such a situation occur, and how do I resolve it/resume the replication?

Niels Kristian
  • 8,661
  • 11
  • 59
  • 117

2 Answers2

1

Connect to the primary and look for abandoned replication slots there:

SELECT slot_name, slot_type, restart_lsn
FROM pg_replication_slots
WHERE NOT active;

Any of these that don't belong to a standby server that's still needed should go:

SELECT pg_drop_replication_slot('name of the slot');

If you need more than 10 replication slots, increase max_replication_slots on the primary and restart it.

It is hard to say how you got into trouble, but probably like this: whenever you say CREATE SUBSCRIPTION, the standby connects to the primary and creates a logical replication slot (unless you explicitly tell it not to). If then something goes wrong with creating the subscription, of if replication breaks, the replication slot is still there on the primary. So perhaps it took many failed attempts until you got logical replication running, and you neglecting cleaning up after failed attempts.

Laurenz Albe
  • 209,280
  • 17
  • 206
  • 263
  • Thanks, we seem to have 30 slots which are inactive - now - how does this come to happen? The only thing we have done was to start the logical replication with the following settings: On the publisher `max_replication_slots: 35`, `max_sync_workers_per_subscription: 2`, `max_logical_replication_workers: 16` and on the 2 subscriber servers: `max_sync_workers_per_subscription: 2`, `max_logical_replication_workers: 16` - is something wrong here? – Niels Kristian May 02 '23 at 08:25
  • I have extended the answer in an attempt to answer that. – Laurenz Albe May 02 '23 at 09:13
0

You are misreading the error. The primary error is that it cannot create the slot, because all allowed slots are already in use.

But the replica then tries to clean up after itself, which it does by attempting the drop the slot which it failed to create in the first place, which of course doesn't work, because there is nothing to clean up. So this part fails, but it is not the interesting part of the overall failure.

jjanes
  • 37,812
  • 5
  • 27
  • 34