pcp_recovery_node command hangs while recovering standby

Question

Its a sub-part of cluster , I am building. When I am executing pcp_recovery_node on master to build standby from scratch with the command

pcp_recovery_node -h 193.185.83.119 -p 9898 -U postgres -n 1

Here, 193.185.83.119 is vip. It successfully builds and starts the standby on node-b ( say nodes are node-a and node-b) but at the same time above command does not return and simply hangs in shell like :-

[postgres@rollc-filesrvr1 data]$ pcp_recovery_node -h 193.185.83.119 -p 9898 -U postgres -n 1 Password:

I have to use ctrl+c to come out of this session. Later on when I try to create a test db on node-a (master) i get following error :

      postgres=# create database test;
        ERROR:  source database "template1" is being accessed by other users
        DETAIL:  There is 1 other session using the database.

I confirm that pgpool.service is running at the time of running this command on node-a and i have tried using on/off pgpool.service on node-b (standby) before issuing pcp command. Result remains the same.

Also I tried googling and tweaked following settings in pgpool.conf . I am not sure if it could be something with these parameters:

wd_lifecheck_dbname in pgpool.conf

Initially related settings to above were ( and i was getting still same result):

wd_lifecheck_dbname = 'template1'
wd_lifecheck_user = 'nobody'
wd_lifecheck_password = ''

later on , i found different settings at here, here and one suggestion at here and tried different combinations like following :

wd_lifecheck_dbname = 'template1'
wd_lifecheck_user = 'postgres'
wd_lifecheck_password = ''

or

wd_lifecheck_dbname = 'postgres'
wd_lifecheck_user = 'postgres'
wd_lifecheck_password = ''

But none helped in changing the situation neither on shell nor allowed me to create test db on master. I feel , i reached a dead end.

I am still not able to fully understand the purpose and meaning of above 3 parameters in pgpool and somehow suspect that these are the ones I am not configuration correct although there could be others reasons also.

just to help , here is the environment details again.

node-a and nod-b environment : rhel 7.6
postgres version : 10.7
pgpool-|| version: 4.0.3
replication slot + wal archive

Here are the logs from node-a pgpool.service

Mar 18 21:10:17 node-a pgpool[16583]: 2019-03-18 21:10:17: pid 16642: LOG:  forked new pcp worker, pid=8534 socket=7
Mar 18 21:10:17 node-a pgpool[16583]: 2019-03-18 21:10:17: pid 8534: LOG:  starting recovering node 1
Mar 18 21:10:17 node-a pgpool[16583]: 2019-03-18 21:10:17: pid 8534: LOG:  executing recovery
Mar 18 21:10:17 node-a pgpool[16583]: 2019-03-18 21:10:17: pid 8534: DETAIL:  starting recovery command: "SELECT pgpool_recovery('recovery_1st_stage', 'node-a-ip', '/data/test/data', '5438', 1)"
Mar 18 21:10:17 node-a pgpool[16583]: 2019-03-18 21:10:17: pid 8534: LOG:  executing recovery
Mar 18 21:10:17 node-a pgpool[16583]: 2019-03-18 21:10:17: pid 8534: DETAIL:  disabling statement_timeout
Mar 18 21:10:18 node-a pgpool[16583]: 2019-03-18 21:10:18: pid 8534: LOG:  node recovery, 1st stage is done
Mar 18 21:11:37 node-a pgpool[16583]: 2019-03-18 21:11:37: pid 8534: LOG:  checking if postmaster is started
Mar 18 21:11:37 node-a pgpool[16583]: 2019-03-18 21:11:37: pid 8534: DETAIL:  trying to connect to postmaster on hostname:node-b-ip database:postgres user:postgres (retry 0 times)
...
...2 more times 
Mar 18 21:11:49 node-a pgpool[16583]: 2019-03-18 21:11:49: pid 8534: LOG:  checking if postmaster is started
Mar 18 21:11:49 node-a pgpool[16583]: 2019-03-18 21:11:49: pid 8534: DETAIL:  trying to connect to postmaster on hostname:node-a-ip database:template1 user:postgres (retry 0 times)
...it keeps on trying till i press ctrl+c on pcp command windows . I have seen it going upto 30 or more.

Also node-b is never being shown as up while checking with pgpool.

postgres=> show pool_nodes; node_id | hostname | port | status | lb_weight | role | select_cnt | load_balance_node | replication_delay | last_status_change ---------+----------------+------+--------+-----------+---------+------------+-------------------+-------------------+--------------------- 0 | node-a-ip | 5438 | up | 0.500000 | primary | 0 | true | 0 | 2019-03-18 22:59:19 1 | node-b-ip | 5438 | down | 0.500000 | standby | 0 | false | 0 | 2019-03-18 22:59:19 (2 rows)

EDIT Now i am at least being able to correct the last part of this query. i.e. adding the standby node to cluster:

[postgres@node-a-hostname]$ pcp_attach_node -n 1 Password: pcp_attach_node -- Command Successful

and now the last part is at least showing correct situation:

postgres=> show pool_nodes; node_id | hostname | port | status | lb_weight | role | select_cnt | load_balance_node | replication_delay | last_status_change ---------+----------------+------+--------+-----------+---------+------------+-------------------+-------------------+--------------------- 0 | node-a-ip | 5438 | up | 0.500000 | primary | 0 | false | 0 | 2019-03-18 22:59:19 1 | node-b-ip | 5438 | up | 0.500000 | standby | 0 | true | 0 | 2019-03-19 11:38:38 (2 rows)

But underlying problem of not able to create db on node1 is still there:

EDIT2: I tried insert and update on master and they are correctly being replicated to node2 but create db is still not working.

score 1 · Accepted Answer · answered Mar 21 '19 at 09:35

First correction to EDIT1: Indeed pcp_attach_node helped correcting the output of show pool_nodes but it further complicated the issue as the other commands

pcp_watchdog_info -h 193.185.83.119 -p 9898 -U postgres

started getting stuck. Later on , I found out

pcp_attach_node -n 1

was not needed at all to attach standby or correcting the output of show pool_nodes; on master IF original pcp_recovery_node completes correctly.

Well, Root cause of original problem , and watchdog getting stuck originated out of it later on , was that pgpool_remote_start script was not completing correctly even after starting standby. I could see it in

ps -ef | grep pgpool

on master.

I contacted pgpool_bug_tracking system at here and they helped me further fixing it. It was incorrect postgres start command in pgpool_remote_start which was causing troubles and hence pcp_recover_node was not completing and no other later on.

Correct command in pgpool_remote_start should be something like( and I used it):

ssh -T postgres@$REMOTE_HOST /usr/pgsql-10/bin/pg_ctl -w start -D /data/test/data 2>/dev/null 1>/dev/null </dev/null &

while i was using

ssh -T postgres@$REMOTE_HOST /usr/pgsql-10/bin/pg_ctl start -D /data/test/data

I was missing -w flag. Also there was no redirection of stdout and stderr to /dev/null and missing EOF signal sending to it.

One still unclear to me but helpful to someone facing similar problem: First start pgpool.service on standby or keep it running before issuing pcp command on master.

One more insight, I did not change any of those wd_lifecheck_dbname in this process and they are still as per my last attempt : wd_lifecheck_dbname = 'postgres', wd_lifecheck_user = 'postgres' and wd_lifecheck_password = 'postgres' and only changing pgpool_remote_start command helped fixing my problem. — user3792812, Mar 21 '19 at 09:38

pcp_recovery_node command hangs while recovering standby

1 Answers1