[ClusterLabs] temporary loss of quorum when member starts to rejoin

Sherrard Burton sb-clusterlabs at allafrica.com
Mon Apr 6 10:05:15 EDT 2020


...or at least that's that i think is happening :-)

two-node cluster, plus quorum-only node. testing the behavior when 
active node is gracefully rebooted. all seems well initially. resources 
are migrated, come up and function as expected.

but, when the rebooted node starts to come back up, the other node seems 
to lose quorum temporarily, even though it still has communication with 
the quorum node. this causes the resources to stop until quorum is 
reestablished.

summary:
active node: xen-nfs01 192.168.250.50
standby node: xen-nfs02 192.168.250.51
quorum node: xen-quorum 192.168.250.52

issue reboot on xen-nfs01
xen-nfs02 becomes active node

xen-nfs01 starts to come back online
xen-nfs02 detects loss of quorum and stops resources
xen-nfs01 finishes booting
quorum is reestablished


instead of overinundating you with all of the debugging output from 
corosync, pacemaker and corosync-qnetd on all nodes, i'll start with the 
basics, and provide whatever else is needed on request.

TIA


from the node that was not rebooted:
Apr  5 23:10:15 xen-nfs02 corosync[19099]:   [KNET  ] udp: Received ICMP 
error from 192.168.250.51: No route to host
Apr  5 23:10:15 xen-nfs02 corosync[19099]:   [KNET  ] udp: Received ICMP 
error from 192.168.250.51: No route to host
Apr  5 23:10:16 xen-nfs02 corosync[19099]:   [KNET  ] udp: Received ICMP 
error from 192.168.250.50: Connection refused
Apr  5 23:10:16 xen-nfs02 corosync[19099]:   [KNET  ] udp: Received ICMP 
error from 192.168.250.50: Connection refused
Apr  5 23:10:16 xen-nfs02 corosync[19099]:   [KNET  ] rx: host: 1 link: 
0 received pong: 1
Apr  5 23:10:17 xen-nfs02 corosync-qdevice[19108]: Received vote info
Apr  5 23:10:17 xen-nfs02 corosync-qdevice[19108]:   seq = 6
Apr  5 23:10:17 xen-nfs02 corosync-qdevice[19108]:   vote = NACK
Apr  5 23:10:17 xen-nfs02 corosync-qdevice[19108]:   ring id = (2.814)
Apr  5 23:10:17 xen-nfs02 corosync-qdevice[19108]: Algorithm result vote 
is NACK
Apr  5 23:10:17 xen-nfs02 corosync-qdevice[19108]: Cast vote timer 
remains scheduled every 500ms voting NACK.
Apr 05 23:10:17 [19099] xen-nfs02 corosync debug   [VOTEQ ] flags: 
quorate: Yes Leaving: No WFA Status: No First: No Qdevice: Yes 
QdeviceAlive: Yes QdeviceCastVote: No QdeviceMasterWins: No
Apr 05 23:10:17 [19099] xen-nfs02 corosync debug   [VOTEQ ] got nodeinfo 
message from cluster node 2
Apr 05 23:10:17 [19099] xen-nfs02 corosync debug   [VOTEQ ] nodeinfo 
message[2]: votes: 1, expected: 3 flags: 49
Apr 05 23:10:17 [19099] xen-nfs02 corosync debug   [VOTEQ ] flags: 
quorate: Yes Leaving: No WFA Status: No First: No Qdevice: Yes 
QdeviceAlive: Yes QdeviceCastVote: No QdeviceMasterWins: No
Apr 05 23:10:17 [19099] xen-nfs02 corosync debug   [VOTEQ ] 
total_votes=2, expected_votes=3
Apr 05 23:10:17 [19099] xen-nfs02 corosync debug   [VOTEQ ] node 1 
state=2, votes=1, expected=3
Apr 05 23:10:17 [19099] xen-nfs02 corosync debug   [VOTEQ ] node 2 
state=1, votes=1, expected=3
Apr 05 23:10:17 [19099] xen-nfs02 corosync debug   [VOTEQ ] quorum lost, 
blocking activity
Apr 05 23:10:17 [19099] xen-nfs02 corosync notice  [QUORUM] This node is 
within the non-primary component and will NOT provide any services.
Apr 05 23:10:17 [19099] xen-nfs02 corosync notice  [QUORUM] Members[1]: 2
Apr 05 23:10:17 [19099] xen-nfs02 corosync debug   [QUORUM] sending 
quorum notification to (nil), length = 52
Apr 05 23:10:17 [19099] xen-nfs02 corosync debug   [VOTEQ ] Sending 
quorum callback, quorate = 0
...
Apr  5 23:10:17 xen-nfs02 corosync-qdevice[19108]: Votequorum quorum 
notify callback:
Apr  5 23:10:17 xen-nfs02 corosync-qdevice[19108]:   Quorate = 0
Apr  5 23:10:17 xen-nfs02 corosync-qdevice[19108]:   Node list (size = 3):
Apr  5 23:10:17 xen-nfs02 corosync-qdevice[19108]:     0 nodeid = 1, 
state = 2
Apr  5 23:10:17 xen-nfs02 corosync-qdevice[19108]:     1 nodeid = 2, 
state = 1
Apr  5 23:10:17 xen-nfs02 corosync-qdevice[19108]:     2 nodeid = 0, 
state = 0
Apr  5 23:10:17 xen-nfs02 corosync-qdevice[19108]: Algorithm decided to 
send list and result vote is No change
Apr  5 23:10:17 xen-nfs02 corosync-qdevice[19108]: Sending quorum node 
list seq = 13, quorate = 0
Apr  5 23:10:17 xen-nfs02 corosync-qdevice[19108]:   Node list:
Apr  5 23:10:17 xen-nfs02 corosync-qdevice[19108]:     0 node_id = 1, 
data_center_id = 0, node_state = dead
Apr  5 23:10:17 xen-nfs02 corosync-qdevice[19108]:     1 node_id = 2, 
data_center_id = 0, node_state = member



from the quorum node:
Apr 05 23:10:17 debug   New client connected
Apr 05 23:10:17 debug     cluster name = xen-nfs01_xen-nfs02
Apr 05 23:10:17 debug     tls started = 1
Apr 05 23:10:17 debug     tls peer certificate verified = 1
Apr 05 23:10:17 debug     node_id = 1
Apr 05 23:10:17 debug     pointer = 0x55b37c2d74f0
Apr 05 23:10:17 debug     addr_str = ::ffff:192.168.250.50:54462
Apr 05 23:10:17 debug     ring id = (1.814)
Apr 05 23:10:17 debug     cluster dump:
Apr 05 23:10:17 debug       client = ::ffff:192.168.250.51:54876, 
node_id = 2
Apr 05 23:10:17 debug       client = ::ffff:192.168.250.50:54462, 
node_id = 1
Apr 05 23:10:17 debug   Client ::ffff:192.168.250.50:54462 (cluster 
xen-nfs01_xen-nfs02, node_id 1) sent initial node list.
Apr 05 23:10:17 debug     msg seq num = 4
Apr 05 23:10:17 debug     node list:
Apr 05 23:10:17 debug       node_id = 1, data_center_id = 0, node_state 
= not set
Apr 05 23:10:17 debug       node_id = 2, data_center_id = 0, node_state 
= not set
Apr 05 23:10:17 debug   Algorithm result vote is Ask later
Apr 05 23:10:17 debug   Client ::ffff:192.168.250.50:54462 (cluster 
xen-nfs01_xen-nfs02, node_id 1) sent membership node list.
Apr 05 23:10:17 debug     msg seq num = 5
Apr 05 23:10:17 debug     ring id = (1.814)
Apr 05 23:10:17 debug     heuristics = Undefined
Apr 05 23:10:17 debug     node list:
Apr 05 23:10:17 debug       node_id = 1, data_center_id = 0, node_state 
= not set
Apr 05 23:10:17 debug   ffsplit: Membership for cluster 
xen-nfs01_xen-nfs02 is now stable
Apr 05 23:10:17 debug   ffsplit: Quorate partition selected
Apr 05 23:10:17 debug     node list:
Apr 05 23:10:17 debug       node_id = 1, data_center_id = 0, node_state 
= not set
Apr 05 23:10:17 debug   Sending vote info to client 
::ffff:192.168.250.51:54876 (cluster xen-nfs01_xen-nfs02, node_id 2)
Apr 05 23:10:17 debug     msg seq num = 6
Apr 05 23:10:17 debug     vote = NACK
Apr 05 23:10:17 debug   Algorithm result vote is No change
Apr 05 23:10:17 debug   Client ::ffff:192.168.250.50:54462 (cluster 
xen-nfs01_xen-nfs02, node_id 1) sent quorum node list.
Apr 05 23:10:17 debug     msg seq num = 6
Apr 05 23:10:17 debug     quorate = 0
Apr 05 23:10:17 debug     node list:
Apr 05 23:10:17 debug       node_id = 1, data_center_id = 0, node_state 
= member
Apr 05 23:10:17 debug   Algorithm result vote is No change
Apr 05 23:10:17 debug   Client ::ffff:192.168.250.51:54876 (cluster 
xen-nfs01_xen-nfs02, node_id 2) replied back to vote info message
Apr 05 23:10:17 debug     msg seq num = 6
Apr 05 23:10:17 debug   ffsplit: All NACK votes sent for cluster 
xen-nfs01_xen-nfs02
Apr 05 23:10:17 debug   Sending vote info to client 
::ffff:192.168.250.50:54462 (cluster xen-nfs01_xen-nfs02, node_id 1)
Apr 05 23:10:17 debug     msg seq num = 1
Apr 05 23:10:17 debug     vote = ACK
Apr 05 23:10:17 debug   Client ::ffff:192.168.250.50:54462 (cluster 
xen-nfs01_xen-nfs02, node_id 1) replied back to vote info message
Apr 05 23:10:17 debug     msg seq num = 1
Apr 05 23:10:17 debug   ffsplit: All ACK votes sent for cluster 
xen-nfs01_xen-nfs02
Apr 05 23:10:17 debug   Client ::ffff:192.168.250.51:54876 (cluster 
xen-nfs01_xen-nfs02, node_id 2) sent quorum node list.
Apr 05 23:10:17 debug     msg seq num = 13
Apr 05 23:10:17 debug     quorate = 0
Apr 05 23:10:17 debug     node list:
Apr 05 23:10:17 debug       node_id = 1, data_center_id = 0, node_state 
= dead
Apr 05 23:10:17 debug       node_id = 2, data_center_id = 0, node_state 
= member
Apr 05 23:10:17 debug   Algorithm result vote is No change



More information about the Users mailing list