[ClusterLabs] temporary loss of quorum when member starts to rejoin
Sherrard Burton
sb-clusterlabs at allafrica.com
Mon Apr 6 13:09:21 EDT 2020
On 4/6/20 12:35 PM, Andrei Borzenkov wrote:
> 06.04.2020 17:05, Sherrard Burton пишет:
>> ...or at least that's that i think is happening :-)
>>
>> two-node cluster, plus quorum-only node. testing the behavior when
>> active node is gracefully rebooted. all seems well initially. resources
>> are migrated, come up and function as expected.
>>
>> but, when the rebooted node starts to come back up, the other node seems
>> to lose quorum temporarily, even though it still has communication with
>> the quorum node. this causes the resources to stop until quorum is
>> reestablished.
>>
>> summary:
>> active node: xen-nfs01 192.168.250.50
>> standby node: xen-nfs02 192.168.250.51
>> quorum node: xen-quorum 192.168.250.52
>>
>> issue reboot on xen-nfs01
>> xen-nfs02 becomes active node
>>
>> xen-nfs01 starts to come back online
>> xen-nfs02 detects loss of quorum and stops resources
>> xen-nfs01 finishes booting
>> quorum is reestablished
>>
>>
>> instead of overinundating you with all of the debugging output from
>> corosync, pacemaker and corosync-qnetd on all nodes, i'll start with the
>> basics, and provide whatever else is needed on request.
>>
>
> Well, to sensibly interpret logs IP address of each and corosync
> configuration are needed at the very least.
node IPs provided above.
corosync conf:
root at xen-nfs01:~# grep -vF -e '#' /etc/corosync/corosync.conf | grep -vFx ''
totem {
version: 2
cluster_name: xen-nfs01_xen-nfs02
crypto_cipher: aes256
crypto_hash: sha512
}
logging {
fileline: off
to_stderr: yes
to_logfile: yes
logfile: /var/log/corosync/corosync.log
to_syslog: yes
debug: on
logger_subsys {
subsys: QUORUM
debug: on
}
}
nodelist {
node {
name: xen-nfs01
nodeid: 1
ring0_addr: 192.168.250.50
}
node {
name: xen-nfs02
nodeid: 2
ring0_addr: 192.168.250.51
}
}
quorum {
provider: corosync_votequorum
device {
model: net
votes: 1
sync_timeout: 3000
timeout: 1000
net {
tls: on
host: xen-quorum
algorithm: ffsplit
}
}
}
>
>> TIA
>>
>>
>> from the node that was not rebooted:
>> Apr 5 23:10:15 xen-nfs02 corosync[19099]: [KNET ] udp: Received ICMP
>> error from 192.168.250.51: No route to host
>> Apr 5 23:10:15 xen-nfs02 corosync[19099]: [KNET ] udp: Received ICMP
>> error from 192.168.250.51: No route to host
>> Apr 5 23:10:16 xen-nfs02 corosync[19099]: [KNET ] udp: Received ICMP
>> error from 192.168.250.50: Connection refused
>> Apr 5 23:10:16 xen-nfs02 corosync[19099]: [KNET ] udp: Received ICMP
>> error from 192.168.250.50: Connection refused
>> Apr 5 23:10:16 xen-nfs02 corosync[19099]: [KNET ] rx: host: 1 link:
>> 0 received pong: 1
>> Apr 5 23:10:17 xen-nfs02 corosync-qdevice[19108]: Received vote info
>> Apr 5 23:10:17 xen-nfs02 corosync-qdevice[19108]: seq = 6
>> Apr 5 23:10:17 xen-nfs02 corosync-qdevice[19108]: vote = NACK
>> Apr 5 23:10:17 xen-nfs02 corosync-qdevice[19108]: ring id = (2.814)
>> Apr 5 23:10:17 xen-nfs02 corosync-qdevice[19108]: Algorithm result vote
>> is NACK
>> Apr 5 23:10:17 xen-nfs02 corosync-qdevice[19108]: Cast vote timer
>> remains scheduled every 500ms voting NACK.
>> Apr 05 23:10:17 [19099] xen-nfs02 corosync debug [VOTEQ ] flags:
>> quorate: Yes Leaving: No WFA Status: No First: No Qdevice: Yes
>> QdeviceAlive: Yes QdeviceCastVote: No QdeviceMasterWins: No
>> Apr 05 23:10:17 [19099] xen-nfs02 corosync debug [VOTEQ ] got nodeinfo
>> message from cluster node 2
>> Apr 05 23:10:17 [19099] xen-nfs02 corosync debug [VOTEQ ] nodeinfo
>> message[2]: votes: 1, expected: 3 flags: 49
>> Apr 05 23:10:17 [19099] xen-nfs02 corosync debug [VOTEQ ] flags:
>> quorate: Yes Leaving: No WFA Status: No First: No Qdevice: Yes
>> QdeviceAlive: Yes QdeviceCastVote: No QdeviceMasterWins: No
>> Apr 05 23:10:17 [19099] xen-nfs02 corosync debug [VOTEQ ]
>> total_votes=2, expected_votes=3
>> Apr 05 23:10:17 [19099] xen-nfs02 corosync debug [VOTEQ ] node 1
>> state=2, votes=1, expected=3
>> Apr 05 23:10:17 [19099] xen-nfs02 corosync debug [VOTEQ ] node 2
>> state=1, votes=1, expected=3
>> Apr 05 23:10:17 [19099] xen-nfs02 corosync debug [VOTEQ ] quorum lost,
>> blocking activity
>
> qdevice decided to not cast vote to nfs02 node.
>
>> Apr 05 23:10:17 [19099] xen-nfs02 corosync notice [QUORUM] This node is
>> within the non-primary component and will NOT provide any services.
>> Apr 05 23:10:17 [19099] xen-nfs02 corosync notice [QUORUM] Members[1]: 2
>> Apr 05 23:10:17 [19099] xen-nfs02 corosync debug [QUORUM] sending
>> quorum notification to (nil), length = 52
>> Apr 05 23:10:17 [19099] xen-nfs02 corosync debug [VOTEQ ] Sending
>> quorum callback, quorate = 0
>> ...
>> Apr 5 23:10:17 xen-nfs02 corosync-qdevice[19108]: Votequorum quorum
>> notify callback:
>> Apr 5 23:10:17 xen-nfs02 corosync-qdevice[19108]: Quorate = 0
>> Apr 5 23:10:17 xen-nfs02 corosync-qdevice[19108]: Node list (size = 3):
>> Apr 5 23:10:17 xen-nfs02 corosync-qdevice[19108]: 0 nodeid = 1,
>> state = 2
>> Apr 5 23:10:17 xen-nfs02 corosync-qdevice[19108]: 1 nodeid = 2,
>> state = 1
>> Apr 5 23:10:17 xen-nfs02 corosync-qdevice[19108]: 2 nodeid = 0,
>> state = 0
>> Apr 5 23:10:17 xen-nfs02 corosync-qdevice[19108]: Algorithm decided to
>> send list and result vote is No change
>> Apr 5 23:10:17 xen-nfs02 corosync-qdevice[19108]: Sending quorum node
>> list seq = 13, quorate = 0
>> Apr 5 23:10:17 xen-nfs02 corosync-qdevice[19108]: Node list:
>> Apr 5 23:10:17 xen-nfs02 corosync-qdevice[19108]: 0 node_id = 1,
>> data_center_id = 0, node_state = dead
>> Apr 5 23:10:17 xen-nfs02 corosync-qdevice[19108]: 1 node_id = 2,
>> data_center_id = 0, node_state = member
>>
>>
>>
>> from the quorum node:
>> Apr 05 23:10:17 debug New client connected
>> Apr 05 23:10:17 debug cluster name = xen-nfs01_xen-nfs02
>> Apr 05 23:10:17 debug tls started = 1
>> Apr 05 23:10:17 debug tls peer certificate verified = 1
>> Apr 05 23:10:17 debug node_id = 1
>> Apr 05 23:10:17 debug pointer = 0x55b37c2d74f0
>> Apr 05 23:10:17 debug addr_str = ::ffff:192.168.250.50:54462
>> Apr 05 23:10:17 debug ring id = (1.814)
>> Apr 05 23:10:17 debug cluster dump:
>> Apr 05 23:10:17 debug client = ::ffff:192.168.250.51:54876,
>> node_id = 2
>> Apr 05 23:10:17 debug client = ::ffff:192.168.250.50:54462,
>> node_id = 1
>> Apr 05 23:10:17 debug Client ::ffff:192.168.250.50:54462 (cluster
>> xen-nfs01_xen-nfs02, node_id 1) sent initial node list.
>> Apr 05 23:10:17 debug msg seq num = 4
>> Apr 05 23:10:17 debug node list:
>> Apr 05 23:10:17 debug node_id = 1, data_center_id = 0, node_state
>> = not set
>> Apr 05 23:10:17 debug node_id = 2, data_center_id = 0, node_state
>> = not set
>> Apr 05 23:10:17 debug Algorithm result vote is Ask later
>> Apr 05 23:10:17 debug Client ::ffff:192.168.250.50:54462 (cluster
>> xen-nfs01_xen-nfs02, node_id 1) sent membership node list.
>> Apr 05 23:10:17 debug msg seq num = 5
>> Apr 05 23:10:17 debug ring id = (1.814)
>> Apr 05 23:10:17 debug heuristics = Undefined
>> Apr 05 23:10:17 debug node list:
>> Apr 05 23:10:17 debug node_id = 1, data_center_id = 0, node_state
>> = not set
>> Apr 05 23:10:17 debug ffsplit: Membership for cluster
>> xen-nfs01_xen-nfs02 is now stable
>> Apr 05 23:10:17 debug ffsplit: Quorate partition selected
>> Apr 05 23:10:17 debug node list:
>> Apr 05 23:10:17 debug node_id = 1, data_center_id = 0, node_state
>> = not set
>> Apr 05 23:10:17 debug Sending vote info to client
>> ::ffff:192.168.250.51:54876 (cluster xen-nfs01_xen-nfs02, node_id 2)
>> Apr 05 23:10:17 debug msg seq num = 6
>> Apr 05 23:10:17 debug vote = NACK
>> Apr 05 23:10:17 debug Algorithm result vote is No change
>> Apr 05 23:10:17 debug Client ::ffff:192.168.250.50:54462 (cluster
>> xen-nfs01_xen-nfs02, node_id 1) sent quorum node list.
>> Apr 05 23:10:17 debug msg seq num = 6
>> Apr 05 23:10:17 debug quorate = 0
>> Apr 05 23:10:17 debug node list:
>> Apr 05 23:10:17 debug node_id = 1, data_center_id = 0, node_state
>> = member
>
> Oops. How comes that node that was rebooted formed cluster all by
> itself, without seeing the second node? Do you have two_nodes and/or
> wait_for_all configured?
neither. i removed two_node when i added the quorum node. i was not
previously familiar with wait_for_all.
>
>> Apr 05 23:10:17 debug Algorithm result vote is No change
>> Apr 05 23:10:17 debug Client ::ffff:192.168.250.51:54876 (cluster
>> xen-nfs01_xen-nfs02, node_id 2) replied back to vote info message
>> Apr 05 23:10:17 debug msg seq num = 6
>> Apr 05 23:10:17 debug ffsplit: All NACK votes sent for cluster
>> xen-nfs01_xen-nfs02
>> Apr 05 23:10:17 debug Sending vote info to client
>> ::ffff:192.168.250.50:54462 (cluster xen-nfs01_xen-nfs02, node_id 1)
>> Apr 05 23:10:17 debug msg seq num = 1
>> Apr 05 23:10:17 debug vote = ACK
>> Apr 05 23:10:17 debug Client ::ffff:192.168.250.50:54462 (cluster
>> xen-nfs01_xen-nfs02, node_id 1) replied back to vote info message
>> Apr 05 23:10:17 debug msg seq num = 1
>> Apr 05 23:10:17 debug ffsplit: All ACK votes sent for cluster
>> xen-nfs01_xen-nfs02
>> Apr 05 23:10:17 debug Client ::ffff:192.168.250.51:54876 (cluster
>> xen-nfs01_xen-nfs02, node_id 2) sent quorum node list.
>> Apr 05 23:10:17 debug msg seq num = 13
>> Apr 05 23:10:17 debug quorate = 0
>> Apr 05 23:10:17 debug node list:
>> Apr 05 23:10:17 debug node_id = 1, data_center_id = 0, node_state
>> = dead
>> Apr 05 23:10:17 debug node_id = 2, data_center_id = 0, node_state
>> = member
>> Apr 05 23:10:17 debug Algorithm result vote is No change
>>
>> _______________________________________________
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> ClusterLabs home: https://www.clusterlabs.org/
>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
More information about the Users
mailing list