[ClusterLabs] temporary loss of quorum when member starts to rejoin
Sherrard Burton
sb-clusterlabs at allafrica.com
Mon Apr 6 10:05:15 EDT 2020
...or at least that's that i think is happening :-)
two-node cluster, plus quorum-only node. testing the behavior when
active node is gracefully rebooted. all seems well initially. resources
are migrated, come up and function as expected.
but, when the rebooted node starts to come back up, the other node seems
to lose quorum temporarily, even though it still has communication with
the quorum node. this causes the resources to stop until quorum is
reestablished.
summary:
active node: xen-nfs01 192.168.250.50
standby node: xen-nfs02 192.168.250.51
quorum node: xen-quorum 192.168.250.52
issue reboot on xen-nfs01
xen-nfs02 becomes active node
xen-nfs01 starts to come back online
xen-nfs02 detects loss of quorum and stops resources
xen-nfs01 finishes booting
quorum is reestablished
instead of overinundating you with all of the debugging output from
corosync, pacemaker and corosync-qnetd on all nodes, i'll start with the
basics, and provide whatever else is needed on request.
TIA
from the node that was not rebooted:
Apr 5 23:10:15 xen-nfs02 corosync[19099]: [KNET ] udp: Received ICMP
error from 192.168.250.51: No route to host
Apr 5 23:10:15 xen-nfs02 corosync[19099]: [KNET ] udp: Received ICMP
error from 192.168.250.51: No route to host
Apr 5 23:10:16 xen-nfs02 corosync[19099]: [KNET ] udp: Received ICMP
error from 192.168.250.50: Connection refused
Apr 5 23:10:16 xen-nfs02 corosync[19099]: [KNET ] udp: Received ICMP
error from 192.168.250.50: Connection refused
Apr 5 23:10:16 xen-nfs02 corosync[19099]: [KNET ] rx: host: 1 link:
0 received pong: 1
Apr 5 23:10:17 xen-nfs02 corosync-qdevice[19108]: Received vote info
Apr 5 23:10:17 xen-nfs02 corosync-qdevice[19108]: seq = 6
Apr 5 23:10:17 xen-nfs02 corosync-qdevice[19108]: vote = NACK
Apr 5 23:10:17 xen-nfs02 corosync-qdevice[19108]: ring id = (2.814)
Apr 5 23:10:17 xen-nfs02 corosync-qdevice[19108]: Algorithm result vote
is NACK
Apr 5 23:10:17 xen-nfs02 corosync-qdevice[19108]: Cast vote timer
remains scheduled every 500ms voting NACK.
Apr 05 23:10:17 [19099] xen-nfs02 corosync debug [VOTEQ ] flags:
quorate: Yes Leaving: No WFA Status: No First: No Qdevice: Yes
QdeviceAlive: Yes QdeviceCastVote: No QdeviceMasterWins: No
Apr 05 23:10:17 [19099] xen-nfs02 corosync debug [VOTEQ ] got nodeinfo
message from cluster node 2
Apr 05 23:10:17 [19099] xen-nfs02 corosync debug [VOTEQ ] nodeinfo
message[2]: votes: 1, expected: 3 flags: 49
Apr 05 23:10:17 [19099] xen-nfs02 corosync debug [VOTEQ ] flags:
quorate: Yes Leaving: No WFA Status: No First: No Qdevice: Yes
QdeviceAlive: Yes QdeviceCastVote: No QdeviceMasterWins: No
Apr 05 23:10:17 [19099] xen-nfs02 corosync debug [VOTEQ ]
total_votes=2, expected_votes=3
Apr 05 23:10:17 [19099] xen-nfs02 corosync debug [VOTEQ ] node 1
state=2, votes=1, expected=3
Apr 05 23:10:17 [19099] xen-nfs02 corosync debug [VOTEQ ] node 2
state=1, votes=1, expected=3
Apr 05 23:10:17 [19099] xen-nfs02 corosync debug [VOTEQ ] quorum lost,
blocking activity
Apr 05 23:10:17 [19099] xen-nfs02 corosync notice [QUORUM] This node is
within the non-primary component and will NOT provide any services.
Apr 05 23:10:17 [19099] xen-nfs02 corosync notice [QUORUM] Members[1]: 2
Apr 05 23:10:17 [19099] xen-nfs02 corosync debug [QUORUM] sending
quorum notification to (nil), length = 52
Apr 05 23:10:17 [19099] xen-nfs02 corosync debug [VOTEQ ] Sending
quorum callback, quorate = 0
...
Apr 5 23:10:17 xen-nfs02 corosync-qdevice[19108]: Votequorum quorum
notify callback:
Apr 5 23:10:17 xen-nfs02 corosync-qdevice[19108]: Quorate = 0
Apr 5 23:10:17 xen-nfs02 corosync-qdevice[19108]: Node list (size = 3):
Apr 5 23:10:17 xen-nfs02 corosync-qdevice[19108]: 0 nodeid = 1,
state = 2
Apr 5 23:10:17 xen-nfs02 corosync-qdevice[19108]: 1 nodeid = 2,
state = 1
Apr 5 23:10:17 xen-nfs02 corosync-qdevice[19108]: 2 nodeid = 0,
state = 0
Apr 5 23:10:17 xen-nfs02 corosync-qdevice[19108]: Algorithm decided to
send list and result vote is No change
Apr 5 23:10:17 xen-nfs02 corosync-qdevice[19108]: Sending quorum node
list seq = 13, quorate = 0
Apr 5 23:10:17 xen-nfs02 corosync-qdevice[19108]: Node list:
Apr 5 23:10:17 xen-nfs02 corosync-qdevice[19108]: 0 node_id = 1,
data_center_id = 0, node_state = dead
Apr 5 23:10:17 xen-nfs02 corosync-qdevice[19108]: 1 node_id = 2,
data_center_id = 0, node_state = member
from the quorum node:
Apr 05 23:10:17 debug New client connected
Apr 05 23:10:17 debug cluster name = xen-nfs01_xen-nfs02
Apr 05 23:10:17 debug tls started = 1
Apr 05 23:10:17 debug tls peer certificate verified = 1
Apr 05 23:10:17 debug node_id = 1
Apr 05 23:10:17 debug pointer = 0x55b37c2d74f0
Apr 05 23:10:17 debug addr_str = ::ffff:192.168.250.50:54462
Apr 05 23:10:17 debug ring id = (1.814)
Apr 05 23:10:17 debug cluster dump:
Apr 05 23:10:17 debug client = ::ffff:192.168.250.51:54876,
node_id = 2
Apr 05 23:10:17 debug client = ::ffff:192.168.250.50:54462,
node_id = 1
Apr 05 23:10:17 debug Client ::ffff:192.168.250.50:54462 (cluster
xen-nfs01_xen-nfs02, node_id 1) sent initial node list.
Apr 05 23:10:17 debug msg seq num = 4
Apr 05 23:10:17 debug node list:
Apr 05 23:10:17 debug node_id = 1, data_center_id = 0, node_state
= not set
Apr 05 23:10:17 debug node_id = 2, data_center_id = 0, node_state
= not set
Apr 05 23:10:17 debug Algorithm result vote is Ask later
Apr 05 23:10:17 debug Client ::ffff:192.168.250.50:54462 (cluster
xen-nfs01_xen-nfs02, node_id 1) sent membership node list.
Apr 05 23:10:17 debug msg seq num = 5
Apr 05 23:10:17 debug ring id = (1.814)
Apr 05 23:10:17 debug heuristics = Undefined
Apr 05 23:10:17 debug node list:
Apr 05 23:10:17 debug node_id = 1, data_center_id = 0, node_state
= not set
Apr 05 23:10:17 debug ffsplit: Membership for cluster
xen-nfs01_xen-nfs02 is now stable
Apr 05 23:10:17 debug ffsplit: Quorate partition selected
Apr 05 23:10:17 debug node list:
Apr 05 23:10:17 debug node_id = 1, data_center_id = 0, node_state
= not set
Apr 05 23:10:17 debug Sending vote info to client
::ffff:192.168.250.51:54876 (cluster xen-nfs01_xen-nfs02, node_id 2)
Apr 05 23:10:17 debug msg seq num = 6
Apr 05 23:10:17 debug vote = NACK
Apr 05 23:10:17 debug Algorithm result vote is No change
Apr 05 23:10:17 debug Client ::ffff:192.168.250.50:54462 (cluster
xen-nfs01_xen-nfs02, node_id 1) sent quorum node list.
Apr 05 23:10:17 debug msg seq num = 6
Apr 05 23:10:17 debug quorate = 0
Apr 05 23:10:17 debug node list:
Apr 05 23:10:17 debug node_id = 1, data_center_id = 0, node_state
= member
Apr 05 23:10:17 debug Algorithm result vote is No change
Apr 05 23:10:17 debug Client ::ffff:192.168.250.51:54876 (cluster
xen-nfs01_xen-nfs02, node_id 2) replied back to vote info message
Apr 05 23:10:17 debug msg seq num = 6
Apr 05 23:10:17 debug ffsplit: All NACK votes sent for cluster
xen-nfs01_xen-nfs02
Apr 05 23:10:17 debug Sending vote info to client
::ffff:192.168.250.50:54462 (cluster xen-nfs01_xen-nfs02, node_id 1)
Apr 05 23:10:17 debug msg seq num = 1
Apr 05 23:10:17 debug vote = ACK
Apr 05 23:10:17 debug Client ::ffff:192.168.250.50:54462 (cluster
xen-nfs01_xen-nfs02, node_id 1) replied back to vote info message
Apr 05 23:10:17 debug msg seq num = 1
Apr 05 23:10:17 debug ffsplit: All ACK votes sent for cluster
xen-nfs01_xen-nfs02
Apr 05 23:10:17 debug Client ::ffff:192.168.250.51:54876 (cluster
xen-nfs01_xen-nfs02, node_id 2) sent quorum node list.
Apr 05 23:10:17 debug msg seq num = 13
Apr 05 23:10:17 debug quorate = 0
Apr 05 23:10:17 debug node list:
Apr 05 23:10:17 debug node_id = 1, data_center_id = 0, node_state
= dead
Apr 05 23:10:17 debug node_id = 2, data_center_id = 0, node_state
= member
Apr 05 23:10:17 debug Algorithm result vote is No change
More information about the Users
mailing list