[ClusterLabs] why is node fenced ?

Fri May 17 09:09:26 EDT 2019

On 16/05/19 17:10 +0200, Lentes, Bernd wrote:
> my HA-Cluster with two nodes fenced one on 14th of may.
> ha-idg-1 has been the DC, ha-idg-2 was fenced.
> It happened around 11:30 am.
> The log from the fenced one isn't really informative:
> 
> [...]
> 
> Node restarts at 11:44 am.
> The DC is more informative:
> 
> =================================
> 2019-05-14T11:24:05.105739+02:00 ha-idg-1 PackageKit: daemon quit
> 2019-05-14T11:24:05.106284+02:00 ha-idg-1 packagekitd[11617]: (packagekitd:11617): GLib-CRITICAL **: Source ID 15 was not found when attempting to remove it
> 2019-05-14T11:27:23.276813+02:00 ha-idg-1 liblogging-stdlog: -- MARK --
> 2019-05-14T11:30:01.248803+02:00 ha-idg-1 cron[24140]: pam_unix(crond:session): session opened for user root by (uid=0)
> 2019-05-14T11:30:01.253150+02:00 ha-idg-1 systemd[1]: Started Session 17988 of user root.
> 2019-05-14T11:30:01.301674+02:00 ha-idg-1 CRON[24140]: pam_unix(crond:session): session closed for user root
> 2019-05-14T11:30:03.710784+02:00 ha-idg-1 kernel: [1015426.947016] tg3 0000:02:00.3 eth3: Link is down
> 2019-05-14T11:30:03.792500+02:00 ha-idg-1 kernel: [1015427.024779] bond1: link status definitely down for interface eth3, disabling it
> 2019-05-14T11:30:04.849892+02:00 ha-idg-1 hp-ams[2559]: CRITICAL: Network Adapter Link Down (Slot 0, Port 4)
> 2019-05-14T11:30:05.261968+02:00 ha-idg-1 kernel: [1015428.498127] tg3 0000:02:00.3 eth3: Link is up at 100 Mbps, full duplex
> 2019-05-14T11:30:05.261985+02:00 ha-idg-1 kernel: [1015428.498138] tg3 0000:02:00.3 eth3: Flow control is on for TX and on for RX
> 2019-05-14T11:30:05.261986+02:00 ha-idg-1 kernel: [1015428.498143] tg3 0000:02:00.3 eth3: EEE is disabled
> 2019-05-14T11:30:05.352500+02:00 ha-idg-1 kernel: [1015428.584725] bond1: link status definitely up for interface eth3, 100 Mbps full duplex
> 2019-05-14T11:30:05.983387+02:00 ha-idg-1 hp-ams[2559]: NOTICE: Network Adapter Link Down (Slot 0, Port 4) has been repaired
> 2019-05-14T11:30:10.520149+02:00 ha-idg-1 corosync[6957]:   [TOTEM ] A processor failed, forming new configuration.
> 2019-05-14T11:30:16.524341+02:00 ha-idg-1 corosync[6957]:   [TOTEM ] A new membership (192.168.100.10:1120) was formed. Members left: 1084777492
> 2019-05-14T11:30:16.524799+02:00 ha-idg-1 corosync[6957]:   [TOTEM ] Failed to receive the leave message. failed: 1084777492
> 2019-05-14T11:30:16.525199+02:00 ha-idg-1 lvm[12430]: confchg callback. 0 joined, 1 left, 1 members
> 2019-05-14T11:30:16.525706+02:00 ha-idg-1 attrd[6967]:   notice: Node ha-idg-2 state is now lost
> 2019-05-14T11:30:16.526143+02:00 ha-idg-1 cib[6964]:   notice: Node ha-idg-2 state is now lost
> 2019-05-14T11:30:16.526480+02:00 ha-idg-1 attrd[6967]:   notice: Removing all ha-idg-2 attributes for peer loss
> 2019-05-14T11:30:16.526742+02:00 ha-idg-1 cib[6964]:   notice: Purged 1 peer with id=1084777492 and/or uname=ha-idg-2 from the membership cache
> 2019-05-14T11:30:16.527283+02:00 ha-idg-1 stonith-ng[6965]:   notice: Node ha-idg-2 state is now lost
> 2019-05-14T11:30:16.527618+02:00 ha-idg-1 attrd[6967]:   notice: Purged 1 peer with id=1084777492 and/or uname=ha-idg-2 from the membership cache
> 2019-05-14T11:30:16.527884+02:00 ha-idg-1 stonith-ng[6965]:   notice: Purged 1 peer with id=1084777492 and/or uname=ha-idg-2 from the membership cache
> 2019-05-14T11:30:16.528156+02:00 ha-idg-1 corosync[6957]:   [QUORUM] Members[1]: 1084777482
> 2019-05-14T11:30:16.528435+02:00 ha-idg-1 corosync[6957]:   [MAIN  ] Completed service synchronization, ready to provide service.
> 2019-05-14T11:30:16.548481+02:00 ha-idg-1 kernel: [1015439.782587] dlm: closing connection to node 1084777492
> 2019-05-14T11:30:16.555995+02:00 ha-idg-1 dlm_controld[12279]: 1015492 fence request 1084777492 pid 24568 nodedown time 1557826216 fence_all dlm_stonith
> 2019-05-14T11:30:16.626285+02:00 ha-idg-1 crmd[6969]:  warning: Stonith/shutdown of node ha-idg-2 was not expected
> 2019-05-14T11:30:16.626534+02:00 ha-idg-1 dlm_stonith: stonith_api_time: Found 1 entries for 1084777492/(null): 0 in progress, 1 completed
> 2019-05-14T11:30:16.626731+02:00 ha-idg-1 dlm_stonith: stonith_api_time: Node 1084777492/(null) last kicked at: 1556884018
> 2019-05-14T11:30:16.626875+02:00 ha-idg-1 stonith-ng[6965]:   notice: Client stonith-api.24568.6a9fa406 wants to fence (reboot) '1084777492' with device '(any)'
> 2019-05-14T11:30:16.627026+02:00 ha-idg-1 crmd[6969]:   notice: State transition S_IDLE -> S_POLICY_ENGINE
> 2019-05-14T11:30:16.627165+02:00 ha-idg-1 crmd[6969]:   notice: Node ha-idg-2 state is now lost
> 2019-05-14T11:30:16.627302+02:00 ha-idg-1 crmd[6969]:  warning: Stonith/shutdown of node ha-idg-2 was not expected
> 2019-05-14T11:30:16.627439+02:00 ha-idg-1 stonith-ng[6965]:   notice: Requesting peer fencing (reboot) of ha-idg-2
> 2019-05-14T11:30:16.627578+02:00 ha-idg-1 pacemakerd[6963]:   notice: Node ha-idg-2 state is now lost
> ==================================
> 
> One network interface is gone for a short period. But it's in a
> bonding device (round-robin), so the connection shouldn't be lost.

Well, not all bonding modes are equal, and without any further
knowledge, my guess is that round-robin provides no redundancy
message-wise, so the token got irrecoverably lost anyway
(remember, corosync messaging has little in common with TCP
you may have in mind when reasoning about this) timeout kicked
in and the configured safety measure (fencing) ensued.

Looks good to me, especially since dlm was involved...

> Both nodes are connected directly, there is no switch in between.
> I manually (ifconfig eth3 down) stopped the interface afterwards
> several times ... nothing happened.

Forget about if-downing any interfaces, please, at least with 
corosync deployments without kronosnet.  Not sure if such
a suggestion shall be exposed more prominently, might be
a good idea for corosync folks to consider.  But this list
is full of these warnings already.

> The same with the second Interface (eth2).
> ???

-- 
Jan (Poki)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 819 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20190517/86328a7b/attachment.sig>