2

I want to achieve automatic VM migration when 1 node dies. I created the proxmox claster, set up replication, and installed watchdog ipmi. But with the loss of 1 node, nothing happens. I used https://pve.proxmox.com/pve-docs/chapter-ha-manager.html, and https://pve.proxmox.com/wiki/High_Availability_Cluster_4.x#Hardware_Watchdogs

ha-manager config
ct:100
        group HA
        max_restart 0
        state started

ha-manager status
quorum OK
master node1 (active, Mon May 18 09:18:59 2020)
lrm node1 (idle, Mon May 18 09:19:00 2020)
lrm node2 (active, Mon May 18 09:19:02 2020)
service ct:100 (node2, started)

When I shutdown node2 I heve log:

May 18 08:12:37 node1 pve-ha-crm[2222]: lost lock 'ha_manager_lock - cfs lock update failed - Operation not permitted
May 18 08:12:38 node1 pmxcfs[2008]: [dcdb] notice: start cluster connection
May 18 08:12:38 node1 pmxcfs[2008]: [dcdb] crit: cpg_join failed: 14
May 18 08:12:38 node1 pmxcfs[2008]: [dcdb] crit: can't initialize service
May 18 08:12:42 node1 pve-ha-crm[2222]: status change master => lost_manager_lock
May 18 08:12:42 node1 pve-ha-crm[2222]: watchdog closed (disabled)
May 18 08:12:42 node1 pve-ha-crm[2222]: status change lost_manager_lock => wait_for_quorum
May 18 08:12:44 node1 pmxcfs[2008]: [dcdb] notice: members: 1/2008
May 18 08:12:44 node1 pmxcfs[2008]: [dcdb] notice: all data is up to date
May 18 08:13:00 node1 systemd[1]: Starting Proxmox VE replication runner...
May 18 08:13:01 node1 pvesr[40781]: trying to acquire cfs lock 'file-replication_cfg' ...
May 18 08:13:02 node1 pvesr[40781]: trying to acquire cfs lock 'file-replication_cfg' ...
May 18 08:13:03 node1 pvesr[40781]: trying to acquire cfs lock 'file-replication_cfg' ...
May 18 08:13:04 node1 pvesr[40781]: trying to acquire cfs lock 'file-replication_cfg' ...
May 18 08:13:05 node1 pvesr[40781]: trying to acquire cfs lock 'file-replication_cfg' ...
May 18 08:13:05 node1 pveproxy[39495]: proxy detected vanished client connection
May 18 08:13:06 node1 pvesr[40781]: trying to acquire cfs lock 'file-replication_cfg' ...
May 18 08:13:07 node1 pvesr[40781]: trying to acquire cfs lock 'file-replication_cfg' ...
May 18 08:13:08 node1 pvesr[40781]: trying to acquire cfs lock 'file-replication_cfg' ...
May 18 08:13:09 node1 pvesr[40781]: trying to acquire cfs lock 'file-replication_cfg' ...
May 18 08:13:10 node1 pvesr[40781]: error with cfs lock 'file-replication_cfg': no quorum!

2 Answers2

3

The problem is with quorum. And is non-trivial and does not work intuitively. If you set up a Proxmox cluster than it turns on a quorum mechanizm. To perform any operation on cluster it needs votes from every node that it understands what is going on. It needs 50% of existing nodes +1 to accept voting.

There is an idi otic default setting when you create 2 node cluster: it needs 50%+1=2 nodes to do anything. So despite being 'cluster' if one node dies you cannot even power on vm /container until both nodes are working.

There is a workaround: in corosync.conf (/etc/corosync/corosync.conf) you have to enable two parameters: two_node: 1 wait_for_all: 0

First parameter is defining that in two nodes cluster situation 1 vote is needed to perform operations. But there is yet another trap for young players: this setting automatically enables another setting: wait_for_all which disables operation of the cluster on poweron until all nodes appear. So this is practically ruining the cluster again. So you have to overcome this too.

Read caarefully this man page: https://www.systutorials.com/docs/linux/man/5-votequorum/

But there is YET ANOTHER catch. There are 2 versions of corosync.conf: /etc/corosync/corosync.conf and /etc/pve/corosync.conf

and whenever the second one is changed the first one is overwritten So you have to edit the later one. But when your second node is down you have to first disable quorum for a moment, and then edit the file.

Damago
  • 101
  • 6
2

Need 3 nodes have HA working. The 3rd node can be replaced with another qdevice to provide the needed vote. See https://pve.proxmox.com/wiki/Cluster_Manager.