[ClusterLabs] temporary loss of quorum when member starts to rejoin
Andrei Borzenkov
arvidjaar at gmail.com
Mon Apr 6 12:35:48 EDT 2020
06.04.2020 17:05, Sherrard Burton пишет:
> ...or at least that's that i think is happening :-)
>
> two-node cluster, plus quorum-only node. testing the behavior when
> active node is gracefully rebooted. all seems well initially. resources
> are migrated, come up and function as expected.
>
> but, when the rebooted node starts to come back up, the other node seems
> to lose quorum temporarily, even though it still has communication with
> the quorum node. this causes the resources to stop until quorum is
> reestablished.
>
> summary:
> active node: xen-nfs01 192.168.250.50
> standby node: xen-nfs02 192.168.250.51
> quorum node: xen-quorum 192.168.250.52
>
> issue reboot on xen-nfs01
> xen-nfs02 becomes active node
>
> xen-nfs01 starts to come back online
> xen-nfs02 detects loss of quorum and stops resources
> xen-nfs01 finishes booting
> quorum is reestablished
>
>
> instead of overinundating you with all of the debugging output from
> corosync, pacemaker and corosync-qnetd on all nodes, i'll start with the
> basics, and provide whatever else is needed on request.
>
Well, to sensibly interpret logs IP address of each and corosync
configuration are needed at the very least.
> TIA
>
>
> from the node that was not rebooted:
> Apr 5 23:10:15 xen-nfs02 corosync[19099]: [KNET ] udp: Received ICMP
> error from 192.168.250.51: No route to host
> Apr 5 23:10:15 xen-nfs02 corosync[19099]: [KNET ] udp: Received ICMP
> error from 192.168.250.51: No route to host
> Apr 5 23:10:16 xen-nfs02 corosync[19099]: [KNET ] udp: Received ICMP
> error from 192.168.250.50: Connection refused
> Apr 5 23:10:16 xen-nfs02 corosync[19099]: [KNET ] udp: Received ICMP
> error from 192.168.250.50: Connection refused
> Apr 5 23:10:16 xen-nfs02 corosync[19099]: [KNET ] rx: host: 1 link:
> 0 received pong: 1
> Apr 5 23:10:17 xen-nfs02 corosync-qdevice[19108]: Received vote info
> Apr 5 23:10:17 xen-nfs02 corosync-qdevice[19108]: seq = 6
> Apr 5 23:10:17 xen-nfs02 corosync-qdevice[19108]: vote = NACK
> Apr 5 23:10:17 xen-nfs02 corosync-qdevice[19108]: ring id = (2.814)
> Apr 5 23:10:17 xen-nfs02 corosync-qdevice[19108]: Algorithm result vote
> is NACK
> Apr 5 23:10:17 xen-nfs02 corosync-qdevice[19108]: Cast vote timer
> remains scheduled every 500ms voting NACK.
> Apr 05 23:10:17 [19099] xen-nfs02 corosync debug [VOTEQ ] flags:
> quorate: Yes Leaving: No WFA Status: No First: No Qdevice: Yes
> QdeviceAlive: Yes QdeviceCastVote: No QdeviceMasterWins: No
> Apr 05 23:10:17 [19099] xen-nfs02 corosync debug [VOTEQ ] got nodeinfo
> message from cluster node 2
> Apr 05 23:10:17 [19099] xen-nfs02 corosync debug [VOTEQ ] nodeinfo
> message[2]: votes: 1, expected: 3 flags: 49
> Apr 05 23:10:17 [19099] xen-nfs02 corosync debug [VOTEQ ] flags:
> quorate: Yes Leaving: No WFA Status: No First: No Qdevice: Yes
> QdeviceAlive: Yes QdeviceCastVote: No QdeviceMasterWins: No
> Apr 05 23:10:17 [19099] xen-nfs02 corosync debug [VOTEQ ]
> total_votes=2, expected_votes=3
> Apr 05 23:10:17 [19099] xen-nfs02 corosync debug [VOTEQ ] node 1
> state=2, votes=1, expected=3
> Apr 05 23:10:17 [19099] xen-nfs02 corosync debug [VOTEQ ] node 2
> state=1, votes=1, expected=3
> Apr 05 23:10:17 [19099] xen-nfs02 corosync debug [VOTEQ ] quorum lost,
> blocking activity
qdevice decided to not cast vote to nfs02 node.
> Apr 05 23:10:17 [19099] xen-nfs02 corosync notice [QUORUM] This node is
> within the non-primary component and will NOT provide any services.
> Apr 05 23:10:17 [19099] xen-nfs02 corosync notice [QUORUM] Members[1]: 2
> Apr 05 23:10:17 [19099] xen-nfs02 corosync debug [QUORUM] sending
> quorum notification to (nil), length = 52
> Apr 05 23:10:17 [19099] xen-nfs02 corosync debug [VOTEQ ] Sending
> quorum callback, quorate = 0
> ...
> Apr 5 23:10:17 xen-nfs02 corosync-qdevice[19108]: Votequorum quorum
> notify callback:
> Apr 5 23:10:17 xen-nfs02 corosync-qdevice[19108]: Quorate = 0
> Apr 5 23:10:17 xen-nfs02 corosync-qdevice[19108]: Node list (size = 3):
> Apr 5 23:10:17 xen-nfs02 corosync-qdevice[19108]: 0 nodeid = 1,
> state = 2
> Apr 5 23:10:17 xen-nfs02 corosync-qdevice[19108]: 1 nodeid = 2,
> state = 1
> Apr 5 23:10:17 xen-nfs02 corosync-qdevice[19108]: 2 nodeid = 0,
> state = 0
> Apr 5 23:10:17 xen-nfs02 corosync-qdevice[19108]: Algorithm decided to
> send list and result vote is No change
> Apr 5 23:10:17 xen-nfs02 corosync-qdevice[19108]: Sending quorum node
> list seq = 13, quorate = 0
> Apr 5 23:10:17 xen-nfs02 corosync-qdevice[19108]: Node list:
> Apr 5 23:10:17 xen-nfs02 corosync-qdevice[19108]: 0 node_id = 1,
> data_center_id = 0, node_state = dead
> Apr 5 23:10:17 xen-nfs02 corosync-qdevice[19108]: 1 node_id = 2,
> data_center_id = 0, node_state = member
>
>
>
> from the quorum node:
> Apr 05 23:10:17 debug New client connected
> Apr 05 23:10:17 debug cluster name = xen-nfs01_xen-nfs02
> Apr 05 23:10:17 debug tls started = 1
> Apr 05 23:10:17 debug tls peer certificate verified = 1
> Apr 05 23:10:17 debug node_id = 1
> Apr 05 23:10:17 debug pointer = 0x55b37c2d74f0
> Apr 05 23:10:17 debug addr_str = ::ffff:192.168.250.50:54462
> Apr 05 23:10:17 debug ring id = (1.814)
> Apr 05 23:10:17 debug cluster dump:
> Apr 05 23:10:17 debug client = ::ffff:192.168.250.51:54876,
> node_id = 2
> Apr 05 23:10:17 debug client = ::ffff:192.168.250.50:54462,
> node_id = 1
> Apr 05 23:10:17 debug Client ::ffff:192.168.250.50:54462 (cluster
> xen-nfs01_xen-nfs02, node_id 1) sent initial node list.
> Apr 05 23:10:17 debug msg seq num = 4
> Apr 05 23:10:17 debug node list:
> Apr 05 23:10:17 debug node_id = 1, data_center_id = 0, node_state
> = not set
> Apr 05 23:10:17 debug node_id = 2, data_center_id = 0, node_state
> = not set
> Apr 05 23:10:17 debug Algorithm result vote is Ask later
> Apr 05 23:10:17 debug Client ::ffff:192.168.250.50:54462 (cluster
> xen-nfs01_xen-nfs02, node_id 1) sent membership node list.
> Apr 05 23:10:17 debug msg seq num = 5
> Apr 05 23:10:17 debug ring id = (1.814)
> Apr 05 23:10:17 debug heuristics = Undefined
> Apr 05 23:10:17 debug node list:
> Apr 05 23:10:17 debug node_id = 1, data_center_id = 0, node_state
> = not set
> Apr 05 23:10:17 debug ffsplit: Membership for cluster
> xen-nfs01_xen-nfs02 is now stable
> Apr 05 23:10:17 debug ffsplit: Quorate partition selected
> Apr 05 23:10:17 debug node list:
> Apr 05 23:10:17 debug node_id = 1, data_center_id = 0, node_state
> = not set
> Apr 05 23:10:17 debug Sending vote info to client
> ::ffff:192.168.250.51:54876 (cluster xen-nfs01_xen-nfs02, node_id 2)
> Apr 05 23:10:17 debug msg seq num = 6
> Apr 05 23:10:17 debug vote = NACK
> Apr 05 23:10:17 debug Algorithm result vote is No change
> Apr 05 23:10:17 debug Client ::ffff:192.168.250.50:54462 (cluster
> xen-nfs01_xen-nfs02, node_id 1) sent quorum node list.
> Apr 05 23:10:17 debug msg seq num = 6
> Apr 05 23:10:17 debug quorate = 0
> Apr 05 23:10:17 debug node list:
> Apr 05 23:10:17 debug node_id = 1, data_center_id = 0, node_state
> = member
Oops. How comes that node that was rebooted formed cluster all by
itself, without seeing the second node? Do you have two_nodes and/or
wait_for_all configured?
> Apr 05 23:10:17 debug Algorithm result vote is No change
> Apr 05 23:10:17 debug Client ::ffff:192.168.250.51:54876 (cluster
> xen-nfs01_xen-nfs02, node_id 2) replied back to vote info message
> Apr 05 23:10:17 debug msg seq num = 6
> Apr 05 23:10:17 debug ffsplit: All NACK votes sent for cluster
> xen-nfs01_xen-nfs02
> Apr 05 23:10:17 debug Sending vote info to client
> ::ffff:192.168.250.50:54462 (cluster xen-nfs01_xen-nfs02, node_id 1)
> Apr 05 23:10:17 debug msg seq num = 1
> Apr 05 23:10:17 debug vote = ACK
> Apr 05 23:10:17 debug Client ::ffff:192.168.250.50:54462 (cluster
> xen-nfs01_xen-nfs02, node_id 1) replied back to vote info message
> Apr 05 23:10:17 debug msg seq num = 1
> Apr 05 23:10:17 debug ffsplit: All ACK votes sent for cluster
> xen-nfs01_xen-nfs02
> Apr 05 23:10:17 debug Client ::ffff:192.168.250.51:54876 (cluster
> xen-nfs01_xen-nfs02, node_id 2) sent quorum node list.
> Apr 05 23:10:17 debug msg seq num = 13
> Apr 05 23:10:17 debug quorate = 0
> Apr 05 23:10:17 debug node list:
> Apr 05 23:10:17 debug node_id = 1, data_center_id = 0, node_state
> = dead
> Apr 05 23:10:17 debug node_id = 2, data_center_id = 0, node_state
> = member
> Apr 05 23:10:17 debug Algorithm result vote is No change
>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
More information about the Users
mailing list