[ClusterLabs] Our 2-Node Cluster with a Separate Qdevice Went Down Anyway?
Andrei Borzenkov
arvidjaar at gmail.com
Mon Mar 1 05:43:49 EST 2021
On 01.03.2021 12:26, Jan Friesse wrote:
>>
>
> Thanks for digging into logs. I believe Eric is hitting
> https://github.com/corosync/corosync-qdevice/issues/10 (already fixed,
> but may take some time to get into distributions) - it also contains
> workaround.
>
I tested corosync-qnetd at df3c672 which should include these fixes. It
changed behavior, still I cannot explain it.
Again, ha1+ha2+qnetd, ha2 is current DC, I disconnect ha1 (block
everything with ha1 source MAC), stonith disabled. corosync and
corosync-qdevice on nodes are still 2.4.5 if it matters.
ha2:
ar 01 13:23:27 ha2 corosync[1576]: [TOTEM ] A processor failed,
forming new configuration.
Mar 01 13:23:28 ha2 corosync[1576]: [VOTEQ ] waiting for quorum device
Qdevice poll (but maximum for 30000 ms)
Mar 01 13:23:28 ha2 corosync[1576]: [TOTEM ] A new membership
(192.168.1.2:3632) was formed. Members left: 1
Mar 01 13:23:28 ha2 corosync[1576]: [TOTEM ] Failed to receive the
leave message. failed: 1
Mar 01 13:23:28 ha2 corosync[1576]: [CPG ] downlist left_list: 1
received
Mar 01 13:23:28 ha2 pacemaker-based[2032]: notice: Node ha1 state is
now lost
Mar 01 13:23:28 ha2 pacemaker-based[2032]: notice: Purged 1 peer with
id=1 and/or uname=ha1 from the membership cache
Mar 01 13:23:28 ha2 pacemaker-attrd[2035]: notice: Lost attribute
writer ha1
Mar 01 13:23:28 ha2 pacemaker-attrd[2035]: notice: Node ha1 state is
now lost
Mar 01 13:23:28 ha2 pacemaker-attrd[2035]: notice: Removing all ha1
attributes for peer loss
Mar 01 13:23:28 ha2 pacemaker-attrd[2035]: notice: Purged 1 peer with
id=1 and/or uname=ha1 from the membership cache
Mar 01 13:23:28 ha2 pacemaker-fenced[2033]: notice: Node ha1 state is
now lost
Mar 01 13:23:28 ha2 pacemaker-fenced[2033]: notice: Purged 1 peer with
id=1 and/or uname=ha1 from the membership cache
Mar 01 13:23:28 ha2 pacemaker-controld[2037]: warning: Stonith/shutdown
of node ha1 was not expected
Mar 01 13:23:28 ha2 pacemaker-controld[2037]: notice: State transition
S_IDLE -> S_POLICY_ENGINE
Mar 01 13:23:33 ha2 pacemaker-controld[2037]: notice: High CPU load
detected: 1.200000
Mar 01 13:23:35 ha2 corosync[1576]: [QUORUM] Members[1]: 2
Mar 01 13:23:35 ha2 corosync[1576]: [MAIN ] Completed service
synchronization, ready to provide service.
Mar 01 13:23:35 ha2 pacemaker-attrd[2035]: notice: Recorded local node
as attribute writer (was unset)
Mar 01 13:23:35 ha2 pacemaker-controld[2037]: notice: Node ha1 state is
now lost
Mar 01 13:23:35 ha2 pacemaker-controld[2037]: warning: Stonith/shutdown
of node ha1 was not expected
Mar 01 13:23:36 ha2 pacemaker-schedulerd[2036]: notice: * Promote
p_drbd0:0 ( Slave -> Master ha2 )
Mar 01 13:23:36 ha2 pacemaker-schedulerd[2036]: notice: * Start
p_fs_clust01 ( ha2 )
Mar 01 13:23:36 ha2 pacemaker-schedulerd[2036]: notice: * Start
p_mysql_001 ( ha2 )
So it is pretty fast to react (8 seconds)
ha1:
Mar 01 13:23:27 ha1 corosync[1552]: [TOTEM ] A processor failed,
forming new configuration.
Mar 01 13:23:30 ha1 corosync[1552]: [VOTEQ ] waiting for quorum device
Qdevice poll (but maximum for 30000 ms)
Mar 01 13:23:30 ha1 corosync[1552]: [TOTEM ] A new membership
(192.168.1.1:3640) was formed. Members left: 2
Mar 01 13:23:30 ha1 corosync[1552]: [TOTEM ] Failed to receive the
leave message. failed: 2
Mar 01 13:23:30 ha1 corosync[1552]: [CPG ] downlist left_list: 1
received
Mar 01 13:23:30 ha1 pacemaker-attrd[1738]: notice: Node ha2 state is
now lost
Mar 01 13:23:30 ha1 pacemaker-attrd[1738]: notice: Removing all ha2
attributes for peer loss
Mar 01 13:23:30 ha1 pacemaker-attrd[1738]: notice: Purged 1 peer with
id=2 and/or uname=ha2 from the membership cache
Mar 01 13:23:30 ha1 pacemaker-based[1735]: notice: Node ha2 state is
now lost
Mar 01 13:23:30 ha1 pacemaker-based[1735]: notice: Purged 1 peer with
id=2 and/or uname=ha2 from the membership cache
Mar 01 13:23:30 ha1 pacemaker-controld[1740]: notice: Our peer on the
DC (ha2) is dead
Mar 01 13:23:30 ha1 pacemaker-controld[1740]: notice: State transition
S_NOT_DC -> S_ELECTION
Mar 01 13:23:30 ha1 pacemaker-fenced[1736]: notice: Node ha2 state is
now lost
Mar 01 13:23:30 ha1 pacemaker-fenced[1736]: notice: Purged 1 peer with
id=2 and/or uname=ha2 from the membership cache
Mar 01 13:23:32 ha1 corosync[1552]: [VOTEQ ] waiting for quorum device
Qdevice poll (but maximum for 30000 ms)
Mar 01 13:23:32 ha1 corosync[1552]: [TOTEM ] A new membership
(192.168.1.1:3644) was formed. Members
Mar 01 13:23:32 ha1 corosync[1552]: [CPG ] downlist left_list: 0
received
Mar 01 13:23:33 ha1 corosync[1552]: [VOTEQ ] waiting for quorum device
Qdevice poll (but maximum for 30000 ms)
Mar 01 13:23:33 ha1 corosync[1552]: [TOTEM ] A new membership
(192.168.1.1:3648) was formed. Members
Mar 01 13:23:33 ha1 corosync[1552]: [CPG ] downlist left_list: 0
received
Mar 01 13:23:35 ha1 corosync[1552]: [VOTEQ ] waiting for quorum device
Qdevice poll (but maximum for 30000 ms)
...
Mar 01 13:24:05 ha1 corosync-qdevice[1563]: Can't connect to qnetd host.
(-5986): Network address not available (in use?)
Mar 01 13:24:05 ha1 corosync-qdevice[1563]: Mar 01 13:24:05 error
Can't connect to qnetd host. (-5986): Network address not available (in
use?)
Mar 01 13:24:05 ha1 corosync[1552]: [VOTEQ ] waiting for quorum device
Qdevice poll (but maximum for 30000 ms)
Mar 01 13:24:05 ha1 corosync[1552]: [TOTEM ] A new membership
(192.168.1.1:3736) was formed. Members
Mar 01 13:24:05 ha1 corosync[1552]: [CPG ] downlist left_list: 0
received
Mar 01 13:24:05 ha1 corosync[1552]: [QUORUM] This node is within the
non-primary component and will NOT provide any services.
Mar 01 13:24:05 ha1 corosync[1552]: [QUORUM] Members[1]: 1
Mar 01 13:24:05 ha1 corosync[1552]: [MAIN ] Completed service
synchronization, ready to provide service.
Mar 01 13:24:05 ha1 pacemaker-controld[1740]: warning: Quorum lost
Mar 01 13:24:05 ha1 pacemaker-controld[1740]: notice: Node ha2 state is
now lost
Mar 01 13:24:05 ha1 pacemaker-controld[1740]: notice: State transition
S_ELECTION -> S_INTEGRATION
Mar 01 13:24:05 ha1 pacemaker-controld[1740]: notice: Updating quorum
status to false (call=56)
Mar 01 13:24:05 ha1 pacemaker-schedulerd[1739]: warning: Blind faith:
not fencing unseen nodes
Mar 01 13:24:05 ha1 pacemaker-schedulerd[1739]: warning: Fencing and
resource management disabled due to lack of quorum
Mar 01 13:24:05 ha1 pacemaker-schedulerd[1739]: notice: * Stop
p_drbd0:0 ( Master ha1 ) due to no quorum
Mar 01 13:24:05 ha1 pacemaker-schedulerd[1739]: notice: * Stop
p_drbd1:0 ( Slave ha1 ) due to no quorum
Mar 01 13:24:05 ha1 pacemaker-schedulerd[1739]: notice: * Stop
p_fs_clust01 ( ha1 ) due to no quorum
Mar 01 13:24:05 ha1 pacemaker-schedulerd[1739]: notice: * Start
p_fs_clust02 ( ha1 ) due to no quorum (blocked)
Mar 01 13:24:05 ha1 pacemaker-schedulerd[1739]: notice: * Stop
p_mysql_001 ( ha1 ) due to no quorum
Mar 01 13:24:05 ha1 pacemaker-schedulerd[1739]: notice: * Start
p_mysql_006 ( ha1 ) due to no quorum (blocked)
Mar 01 13:24:05 ha1 pacemaker-schedulerd[1739]: notice: * Start
p_mysql_666 ( ha1 ) due to no quorum (blocked)
Mar 01 13:24:05 ha1 pacemaker-controld[1740]: notice: Processing graph
0 (ref=pe_calc-dc-16145
So it took it almost 40 seconds to make decision. Somehow it is exactly
the opposite of what I observed before - disconnected node was fast and
connected node was slow.
While I can understand why behavior changed for connected node, I still
do not understand why disconnected node now needs so much time.
And what is worse, this is not reliable - next time I test both nodes
react almost immediately (just 3 seconds to reach decision it is out of
quorum for disconnected node). That is the most irritating, as one
expects consistent behavior here.
That is something inside corosync/corosync-qdevice. At least it seems to
improve situation with qnetd response timing to survived nodes.
More information about the Users
mailing list