[ClusterLabs] Antw: Strange lost quorum with qdevice
Ulrich Windl
Ulrich.Windl at rz.uni-regensburg.de
Mon Aug 12 01:46:01 EDT 2019
On a quick glance this message seems to indicate the problem:
crmd[1214]: warning: Input I_ELECTION_DC received in state S_INTEGRATION from
...
To me it looks as if more than one node had a comminication problem.
How did you simulate node failure?
>>> ???? ???????? <splarv at ya.ru> schrieb am 08.08.2019 um 17:41 in Nachricht
<4BC322B7-FE8C-48AB-BE20-94DEF65CF45C at ya.ru>:
> Hello all.
>
> I have a test bed with several virtual machines to test pacemaker. I
> simulate random failure on one of the node. The cluster will be on several
> data centres, so there is not stonith device, instead I use qnetd on the
> third data centre and watchdog (softdog). And sometimes (not always) on
> failure on one node, the second node also watchdoged due to lost of the
> quorum. I increased quorum timeouts 2 times:
>
> for qnetd: COROSYNC_QNETD_OPTIONS="-S dpd_interval=20000 -d"
> for pacemaker: pcs quorum device add sync_timeout=60000 timeout=20000 model
> net host=‘witness' algorithm=ffsplit
>
> Also I set -I 60 option (net timeout) to sbd.
>
> But effect is still exists:
>
> Logs, after one of the nodes «tuchanka1a» was powered off in 17:13:53.
>
> From server 'witness' with qnetd:
>
> Aug 8 17:13:55 witness corosync-qnetd: Aug 08 17:13:55 debug Client
> ::ffff:192.168.89.12:39144 (cluster krogan1, node_id 2) sent membership node
> list.
> Aug 8 17:13:55 witness corosync-qnetd: Aug 08 17:13:55 debug msg seq
> num = 7
> Aug 8 17:13:55 witness corosync-qnetd: Aug 08 17:13:55 debug ring id =
> (2.4c)
> Aug 8 17:13:55 witness corosync-qnetd: Aug 08 17:13:55 debug heuristics
> = Undefined
> Aug 8 17:13:55 witness corosync-qnetd: Aug 08 17:13:55 debug node
list:
> Aug 8 17:13:55 witness corosync-qnetd: Aug 08 17:13:55 debug node_id
> = 2, data_center_id = 0, node_state = not set
> Aug 8 17:13:55 witness corosync-qnetd: Aug 08 17:13:55 debug ffsplit:
> Membership for cluster krogan1 is not yet stable
> Aug 8 17:13:55 witness corosync-qnetd: Aug 08 17:13:55 debug Algorithm
> result vote is Wait for reply
> Aug 8 17:14:55 witness corosync-qnetd: Aug 08 17:14:55 debug Client
> ::ffff:192.168.89.12:39144 (cluster krogan1, node_id 2) sent quorum node
> list.
> Aug 8 17:14:55 witness corosync-qnetd: Aug 08 17:14:55 debug msg seq
> num = 8
> Aug 8 17:14:55 witness corosync-qnetd: Aug 08 17:14:55 debug quorate =
> 0
> Aug 8 17:14:55 witness corosync-qnetd: Aug 08 17:14:55 debug node
list:
> Aug 8 17:14:55 witness corosync-qnetd: Aug 08 17:14:55 debug node_id
> = 1, data_center_id = 0, node_state = dead
> Aug 8 17:14:55 witness corosync-qnetd: Aug 08 17:14:55 debug node_id
> = 2, data_center_id = 0, node_state = member
> Aug 8 17:14:55 witness corosync-qnetd: Aug 08 17:14:55 debug Algorithm
> result vote is No change
> Aug 8 17:15:00 witness corosync-qnetd: Aug 08 17:15:00 warning Client
> ::ffff:192.168.89.11:47456 doesn't sent any message during 40000ms.
> Disconnecting
> Aug 8 17:15:00 witness corosync-qnetd: Aug 08 17:15:00 debug Client
> ::ffff:192.168.89.11:47456 (init_received 1, cluster krogan1, node_id 1)
> disconnect
> Aug 8 17:15:00 witness corosync-qnetd: Aug 08 17:15:00 debug ffsplit:
> Membership for cluster krogan1 is now stable
> Aug 8 17:15:00 witness corosync-qnetd: Aug 08 17:15:00 debug ffsplit:
> Quorate partition selected
> Aug 8 17:15:00 witness corosync-qnetd: Aug 08 17:15:00 debug node
list:
> Aug 8 17:15:00 witness corosync-qnetd: Aug 08 17:15:00 debug node_id
> = 2, data_center_id = 0, node_state = not set
> Aug 8 17:15:00 witness corosync-qnetd: Aug 08 17:15:00 debug ffsplit: No
> client gets NACK
> Aug 8 17:15:00 witness corosync-qnetd: Aug 08 17:15:00 debug Sending vote
> info to client ::ffff:192.168.89.12:39144 (cluster krogan1, node_id 2)
> Aug 8 17:15:00 witness corosync-qnetd: Aug 08 17:15:00 debug msg seq
> num = 2
> Aug 8 17:15:00 witness corosync-qnetd: Aug 08 17:15:00 debug vote =
ACK
> Aug 8 17:15:00 witness corosync-qnetd: Aug 08 17:15:00 debug Client
> ::ffff:192.168.89.12:39144 (cluster krogan1, node_id 2) replied back to vote
> info message
> Aug 8 17:15:00 witness corosync-qnetd: Aug 08 17:15:00 debug msg seq
> num = 2
> Aug 8 17:15:00 witness corosync-qnetd: Aug 08 17:15:00 debug ffsplit: All
> ACK votes sent for cluster krogan1
> Aug 8 17:17:00 witness corosync-qnetd: Aug 08 17:17:00 warning Client
> ::ffff:192.168.89.12:39144 doesn't sent any message during 40000ms.
> Disconnecting
> Aug 8 17:17:00 witness corosync-qnetd: Aug 08 17:17:00 debug Client
> ::ffff:192.168.89.12:39144 (init_received 1, cluster krogan1, node_id 2)
> disconnect
> Aug 8 17:17:00 witness corosync-qnetd: Aug 08 17:17:00 debug ffsplit:
> Membership for cluster krogan1 is now stable
> Aug 8 17:17:00 witness corosync-qnetd: Aug 08 17:17:00 debug ffsplit: No
> quorate partition was selected
> Aug 8 17:17:00 witness corosync-qnetd: Aug 08 17:17:00 debug ffsplit: No
> client gets NACK
> Aug 8 17:17:00 witness corosync-qnetd: Aug 08 17:17:00 debug ffsplit: No
> client gets ACK
>
> From other node ‘tuchanka1b’. I deleted all rows from sbd "info:
> notify_parent: Notifying parent: healthy" and "info: notify_parent: Not
> notifying parent: state transient (2)".
>
> Aug 8 17:13:54 tuchanka1b corosync[1185]: [TOTEM ] A processor failed,
> forming new configuration.
> Aug 8 17:13:55 tuchanka1b corosync[1185]: [TOTEM ] A new membership
> (192.168.89.12:76) was formed. Members left: 1
> Aug 8 17:13:55 tuchanka1b corosync[1185]: [TOTEM ] Failed to receive the
> leave message. failed: 1
> Aug 8 17:13:55 tuchanka1b corosync[1185]: [VOTEQ ] waiting for quorum
> device Qdevice poll (but maximum for 60000 ms)
> Aug 8 17:13:55 tuchanka1b stonith-ng[1210]: notice: Node tuchanka1a state
> is now lost
> Aug 8 17:13:55 tuchanka1b stonith-ng[1210]: notice: Purged 1 peer with
> id=1 and/or uname=tuchanka1a from the membership cache
> Aug 8 17:13:55 tuchanka1b crmd[1214]: notice: Our peer on the DC
> (tuchanka1a) is dead
> Aug 8 17:13:55 tuchanka1b crmd[1214]: notice: State transition S_NOT_DC ->
> S_ELECTION
> Aug 8 17:13:55 tuchanka1b attrd[1212]: notice: Node tuchanka1a state is
> now lost
> Aug 8 17:13:55 tuchanka1b attrd[1212]: notice: Removing all tuchanka1a
> attributes for peer loss
> Aug 8 17:13:55 tuchanka1b attrd[1212]: notice: Lost attribute writer
> tuchanka1a
> Aug 8 17:13:55 tuchanka1b attrd[1212]: notice: Purged 1 peer with id=1
> and/or uname=tuchanka1a from the membership cache
> Aug 8 17:13:55 tuchanka1b cib[1209]: notice: Node tuchanka1a state is now
> lost
> Aug 8 17:13:55 tuchanka1b cib[1209]: notice: Purged 1 peer with id=1
> and/or uname=tuchanka1a from the membership cacheAug 8 17:14:25 tuchanka1b
> crmd[1214]: notice: Deletion of
> "//node_state[@uname='tuchanka1a']/transient_attributes": Timer expired
> (rc=-62)
> Aug 8 17:14:55 tuchanka1b corosync[1185]: [VOTEQ ] lost contact with quorum
> device Qdevice
> Aug 8 17:14:55 tuchanka1b corosync[1185]: [VOTEQ ] Waiting for all cluster
> members. Current votes: 1 expected_votes: 3
> Aug 8 17:14:55 tuchanka1b corosync[1185]: [QUORUM] This node is within the
> non-primary component and will NOT provide any services.
> Aug 8 17:14:55 tuchanka1b corosync[1185]: [QUORUM] Members[1]: 2
> Aug 8 17:14:55 tuchanka1b corosync[1185]: [MAIN ] Completed service
> synchronization, ready to provide service.
> Aug 8 17:14:55 tuchanka1b pacemakerd[1208]: warning: Quorum lost
> Aug 8 17:14:55 tuchanka1b pacemakerd[1208]: notice: Node tuchanka1a state
> is now lost
> Aug 8 17:14:55 tuchanka1b crmd[1214]: warning: Quorum lost
> Aug 8 17:14:55 tuchanka1b crmd[1214]: notice: Node tuchanka1a state is now
> lost
> Aug 8 17:14:55 tuchanka1b sbd[1182]: cluster: info: notify_parent:
> Notifying parent: healthy
> Aug 8 17:14:55 tuchanka1b crmd[1214]: notice: State transition S_ELECTION
> -> S_INTEGRATION
> Aug 8 17:14:55 tuchanka1b crmd[1214]: warning: Input I_ELECTION_DC received
> in state S_INTEGRATION from do_election_check
> Aug 8 17:14:56 tuchanka1b pengine[1213]: notice: Watchdog will be used via
> SBD if fencing is required
> Aug 8 17:14:56 tuchanka1b pengine[1213]: warning: Fencing and resource
> management disabled due to lack of quorum
> Aug 8 17:14:56 tuchanka1b pengine[1213]: warning: Cluster node tuchanka1a
> is unclean: peer is no longer part of the cluster
> Aug 8 17:14:56 tuchanka1b pengine[1213]: warning: Node tuchanka1a is
> unclean
> Aug 8 17:14:56 tuchanka1b pengine[1213]: warning: Action
> krogan1DB:1_demote_0 on tuchanka1a is unrunnable (offline)
> Aug 8 17:14:56 tuchanka1b pengine[1213]: warning: Action krogan1DB:1_stop_0
> on tuchanka1a is unrunnable (offline)
> Aug 8 17:14:56 tuchanka1b pengine[1213]: warning: Action
> krogan1DB:1_demote_0 on tuchanka1a is unrunnable (offline)
> Aug 8 17:14:56 tuchanka1b pengine[1213]: warning: Action krogan1DB:1_stop_0
> on tuchanka1a is unrunnable (offline)
> Aug 8 17:14:56 tuchanka1b pengine[1213]: warning: Action
> krogan1DB:1_demote_0 on tuchanka1a is unrunnable (offline)
> Aug 8 17:14:56 tuchanka1b pengine[1213]: warning: Action krogan1DB:1_stop_0
> on tuchanka1a is unrunnable (offline)
> Aug 8 17:14:56 tuchanka1b pengine[1213]: warning: Action
> krogan1DB:1_demote_0 on tuchanka1a is unrunnable (offline)
> Aug 8 17:14:56 tuchanka1b pengine[1213]: warning: Action krogan1DB:1_stop_0
> on tuchanka1a is unrunnable (offline)
> Aug 8 17:14:56 tuchanka1b pengine[1213]: warning: Action krogan1IP_stop_0
> on tuchanka1a is unrunnable (offline)
> Aug 8 17:14:56 tuchanka1b pengine[1213]: warning: Node tuchanka1a is
> unclean!
> Aug 8 17:14:56 tuchanka1b pengine[1213]: notice: Cannot fence unclean
> nodes until quorum is attained (or no-quorum-policy is set to ignore)
> Aug 8 17:14:56 tuchanka1b pengine[1213]: notice: * Stop krogan1DB:0
> ( Slave tuchanka1b ) due to no quorum
> Aug 8 17:14:56 tuchanka1b pengine[1213]: notice: * Stop krogan1DB:1
> ( Master tuchanka1a ) due to node availability (blocked)
> Aug 8 17:14:56 tuchanka1b pengine[1213]: notice: * Stop krogan1IP
> ( tuchanka1a ) due to no quorum (blocked)
> Aug 8 17:14:56 tuchanka1b pengine[1213]: notice: * Stop krogan1sIP
> ( tuchanka1b ) due to no quorum
> Aug 8 17:14:56 tuchanka1b pengine[1213]: warning: Calculated transition 0
> (with warnings), saving inputs in /var/lib/pacemaker/pengine/pe-warn-4.bz2
> Aug 8 17:14:56 tuchanka1b crmd[1214]: notice: Initiating cancel operation
> krogan1DB_monitor_15000 locally on tuchanka1b
> Aug 8 17:14:56 tuchanka1b crmd[1214]: notice: Initiating cancel operation
> krogan1DB_monitor_17000 locally on tuchanka1b
> Aug 8 17:14:56 tuchanka1b crmd[1214]: notice: Initiating notify operation
> krogan1DB_pre_notify_stop_0 locally on tuchanka1b
> Aug 8 17:14:56 tuchanka1b stonith-ng[1210]: notice: Watchdog will be used
> via SBD if fencing is required
> Aug 8 17:14:56 tuchanka1b crmd[1214]: notice: Initiating stop operation
> krogan1sIP_stop_0 locally on tuchanka1b
> Aug 8 17:14:56 tuchanka1b stonith-ng[1210]: notice: Watchdog will be used
> via SBD if fencing is required
> Aug 8 17:14:56 tuchanka1b stonith-ng[1210]: notice: Watchdog will be used
> via SBD if fencing is required
> Aug 8 17:14:56 tuchanka1b IPaddr2(krogan1sIP)[5245]: INFO: IP status = ok,
> IP_CIP=
> Aug 8 17:14:56 tuchanka1b crmd[1214]: notice: Result of stop operation for
> krogan1sIP on tuchanka1b: 0 (ok)
> Aug 8 17:14:56 tuchanka1b stonith-ng[1210]: notice: Watchdog will be used
> via SBD if fencing is required
> Aug 8 17:14:56 tuchanka1b crmd[1214]: notice: Result of notify operation
> for krogan1DB on tuchanka1b: 0 (ok)
> Aug 8 17:14:56 tuchanka1b crmd[1214]: notice: Initiating stop operation
> krogan1DB_stop_0 locally on tuchanka1b
> Aug 8 17:14:56 tuchanka1b stonith-ng[1210]: notice: Watchdog will be used
> via SBD if fencing is required
> Aug 8 17:14:56 tuchanka1b sbd[1181]: pcmk: warning: cluster_status:
> Fencing and resource management disabled due to lack of quorum
> Aug 8 17:14:56 tuchanka1b sbd[1181]: pcmk: warning: pe_fence_node:
> Cluster node tuchanka1a is unclean: peer is no longer part of the cluster
> Aug 8 17:14:56 tuchanka1b sbd[1181]: pcmk: warning:
> determine_online_status: Node tuchanka1a is unclean
> Aug 8 17:14:56 tuchanka1b sbd[1181]: pcmk: info:
> set_servant_health: Quorum lost: Stop ALL resources
> Aug 8 17:14:56 tuchanka1b sbd[1181]: pcmk: info: notify_parent:
> Not notifying parent: state transient (2)
> Aug 8 17:14:57 tuchanka1b pgsqlms(krogan1DB)[5312]: INFO: Instance
> "krogan1DB" stopped
> Aug 8 17:14:57 tuchanka1b crmd[1214]: notice: Result of stop operation for
> krogan1DB on tuchanka1b: 0 (ok)
> Aug 8 17:14:57 tuchanka1b stonith-ng[1210]: notice: Watchdog will be used
> via SBD if fencing is required
> Aug 8 17:14:57 tuchanka1b crmd[1214]: notice: Transition 0 (Complete=11,
> Pending=0, Fired=0, Skipped=0, Incomplete=0,
> Source=/var/lib/pacemaker/pengine/pe-warn-4.bz2): Complete
> Aug 8 17:14:57 tuchanka1b crmd[1214]: notice: State transition
> S_TRANSITION_ENGINE -> S_IDLEAug 8 17:14:57 tuchanka1b sbd[1181]:
pcmk:
> info: notify_parent: Not notifying parent: state transient (2)
> Aug 8 17:14:58 tuchanka1b ntpd[557]: Deleting interface #8 eth0,
> 192.168.89.104#123, interface stats: received=0, sent=0, dropped=0,
> active_time=260 secs
> from now and later, repeatable Aug 8 17:14:58 tuchanka1b sbd[1181]:
> pcmk: info: notify_parent: Not notifying parent: state transient (2)
> from now and later, repeatable Aug 8 17:14:59 tuchanka1b sbd[1181]:
> pcmk: warning: cluster_status: Fencing and resource management disabled due
> to lack of quorum
> from now and later, repeatable Aug 8 17:14:59 tuchanka1b sbd[1181]:
> pcmk: warning: pe_fence_node: Cluster node tuchanka1a is unclean: peer is
no
> longer part of the cluster
> from now and later, repeatable Aug 8 17:14:59 tuchanka1b sbd[1181]:
> pcmk: warning: determine_online_status: Node tuchanka1a is unclean
> Aug 8 17:15:00 tuchanka1b corosync[1185]: [VOTEQ ] Waiting for all cluster
> members. Current votes: 2 expected_votes: 3
> Aug 8 17:15:56 tuchanka1b sbd[1180]: warning: inquisitor_child: Servant
> pcmk is outdated (age: 61)
> Aug 8 17:15:59 tuchanka1b sbd[1180]: warning: inquisitor_child: Latency: No
> liveness for 4 s exceeds threshold of 3 s (healthy servants: 0)
> Aug 8 17:15:59 tuchanka1b sbd[1180]: warning: inquisitor_child: Latency: No
> liveness for 4 s exceeds threshold of 3 s (healthy servants: 0)
> Aug 8 17:16:00 tuchanka1b sbd[1180]: warning: inquisitor_child: Latency: No
> liveness for 5 s exceeds threshold of 3 s (healthy servants: 0)
> Rebooted
>
> What is strange:
>
> 1. The node expect quorum in 60s, but it got quorum vote from qnetd on 5s
> later. So in 17:14:55 the node lost quorum.
> 2. In 17:15:00 it got vote and but not quorum. Is it due to "wait for all"
> option?
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
More information about the Users
mailing list