[ClusterLabs] Antw: why is node fenced ?

Mon May 20 02:28:22 EDT 2019

>>> "Lentes, Bernd" <bernd.lentes at helmholtz-muenchen.de> schrieb am 16.05.2019
um
17:10 in Nachricht
<1151882511.6631123.1558019430655.JavaMail.zimbra at helmholtz-muenchen.de>:
> Hi,
> 
> my HA-Cluster with two nodes fenced one on 14th of may.
> ha-idg-1 has been the DC, ha-idg-2 was fenced.
> It happened around 11:30 am.
> The log from the fenced one isn't really informative:
> 
> ==================================
> 2019-05-14T11:22:09.948980+02:00 ha-idg-2 liblogging-stdlog: -- MARK --
> 2019-05-14T11:28:21.548898+02:00 ha-idg-2 sshd[14269]: Accepted 
> keyboard-interactive/pam for root from 10.35.34.70 port 59449 ssh2
> 2019-05-14T11:28:21.550602+02:00 ha-idg-2 sshd[14269]: 
> pam_unix(sshd:session): session opened for user root by (uid=0)
> 2019-05-14T11:28:21.554640+02:00 ha-idg-2 systemd-logind[2798]: New session

> 15385 of user root.
> 2019-05-14T11:28:21.555067+02:00 ha-idg-2 systemd[1]: Started Session 15385

> of user root.
> 
> 2019-05-14T11:44:07.664785+02:00 ha-idg-2 systemd[1]: systemd 228 running in

> system mode. (+PAM -AUDIT +SELINUX -IMA +APPARMOR -SMACK +SYSVINIT +UTMP 
> +LIBCRYPTSETUP +GC   Neustart !!!
> RYPT -GNUTLS +ACL +XZ -LZ4 +SECCOMP +BLKID -ELFUTILS +KMOD -IDN)
> 2019-05-14T11:44:07.664902+02:00 ha-idg-2 kernel: [    0.000000] Linux 
> version 4.12.14-95.13-default (geeko at buildhost) (gcc version 4.8.5 (SUSE 
> Linux) ) #1 SMP Fri Mar
> 22 06:04:58 UTC 2019 (c01bf34)
> 2019-05-14T11:44:07.665492+02:00 ha-idg-2 systemd[1]: Detected architecture

> x86-64.
> 2019-05-14T11:44:07.665510+02:00 ha-idg-2 kernel: [    0.000000] Command 
> line: BOOT_IMAGE=/boot/vmlinuz-4.12.14-95.13-default 
> root=/dev/mapper/vg_local-lv_root resume=/
> dev/disk/by-uuid/2849c504-2e45-4ec8-bbf8-724cf358ee25 splash=verbose 
> showopts
> 2019-05-14T11:44:07.665510+02:00 ha-idg-2 systemd[1]: Set hostname to 
><ha-idg-2>.
> =================================
> 
> Node restarts at 11:44 am.
> The DC is more informative:
> 
> =================================
> 2019-05-14T11:24:05.105739+02:00 ha-idg-1 PackageKit: daemon quit
> 2019-05-14T11:24:05.106284+02:00 ha-idg-1 packagekitd[11617]: 
> (packagekitd:11617): GLib-CRITICAL **: Source ID 15 was not found when 
> attempting to remove it
> 2019-05-14T11:27:23.276813+02:00 ha-idg-1 liblogging-stdlog: -- MARK --
> 2019-05-14T11:30:01.248803+02:00 ha-idg-1 cron[24140]: 
> pam_unix(crond:session): session opened for user root by (uid=0)
> 2019-05-14T11:30:01.253150+02:00 ha-idg-1 systemd[1]: Started Session 17988

> of user root.
> 2019-05-14T11:30:01.301674+02:00 ha-idg-1 CRON[24140]: 
> pam_unix(crond:session): session closed for user root
> 2019-05-14T11:30:03.710784+02:00 ha-idg-1 kernel: [1015426.947016] tg3 
> 0000:02:00.3 eth3: Link is down
> 2019-05-14T11:30:03.792500+02:00 ha-idg-1 kernel: [1015427.024779] bond1: 
> link status definitely down for interface eth3, disabling it
> 2019-05-14T11:30:04.849892+02:00 ha-idg-1 hp-ams[2559]: CRITICAL: Network 
> Adapter Link Down (Slot 0, Port 4)
> 2019-05-14T11:30:05.261968+02:00 ha-idg-1 kernel: [1015428.498127] tg3 
> 0000:02:00.3 eth3: Link is up at 100 Mbps, full duplex
> 2019-05-14T11:30:05.261985+02:00 ha-idg-1 kernel: [1015428.498138] tg3 
> 0000:02:00.3 eth3: Flow control is on for TX and on for RX
> 2019-05-14T11:30:05.261986+02:00 ha-idg-1 kernel: [1015428.498143] tg3 
> 0000:02:00.3 eth3: EEE is disabled
> 2019-05-14T11:30:05.352500+02:00 ha-idg-1 kernel: [1015428.584725] bond1: 
> link status definitely up for interface eth3, 100 Mbps full duplex
> 2019-05-14T11:30:05.983387+02:00 ha-idg-1 hp-ams[2559]: NOTICE: Network 
> Adapter Link Down (Slot 0, Port 4) has been repaired
> 2019-05-14T11:30:10.520149+02:00 ha-idg-1 corosync[6957]:   [TOTEM ] A 
> processor failed, forming new configuration.
> 2019-05-14T11:30:16.524341+02:00 ha-idg-1 corosync[6957]:   [TOTEM ] A new 
> membership (192.168.100.10:1120) was formed. Members left: 1084777492
> 2019-05-14T11:30:16.524799+02:00 ha-idg-1 corosync[6957]:   [TOTEM ] Failed

> to receive the leave message. failed: 1084777492
> 2019-05-14T11:30:16.525199+02:00 ha-idg-1 lvm[12430]: confchg callback. 0 
> joined, 1 left, 1 members
> 2019-05-14T11:30:16.525706+02:00 ha-idg-1 attrd[6967]:   notice: Node 
> ha-idg-2 state is now lost
> 2019-05-14T11:30:16.526143+02:00 ha-idg-1 cib[6964]:   notice: Node ha-idg-2

> state is now lost
> 2019-05-14T11:30:16.526480+02:00 ha-idg-1 attrd[6967]:   notice: Removing 
> all ha-idg-2 attributes for peer loss
> 2019-05-14T11:30:16.526742+02:00 ha-idg-1 cib[6964]:   notice: Purged 1 peer

> with id=1084777492 and/or uname=ha-idg-2 from the membership cache
> 2019-05-14T11:30:16.527283+02:00 ha-idg-1 stonith-ng[6965]:   notice: Node 
> ha-idg-2 state is now lost
> 2019-05-14T11:30:16.527618+02:00 ha-idg-1 attrd[6967]:   notice: Purged 1 
> peer with id=1084777492 and/or uname=ha-idg-2 from the membership cache
> 2019-05-14T11:30:16.527884+02:00 ha-idg-1 stonith-ng[6965]:   notice: Purged

> 1 peer with id=1084777492 and/or uname=ha-idg-2 from the membership cache
> 2019-05-14T11:30:16.528156+02:00 ha-idg-1 corosync[6957]:   [QUORUM] 
> Members[1]: 1084777482
> 2019-05-14T11:30:16.528435+02:00 ha-idg-1 corosync[6957]:   [MAIN  ] 
> Completed service synchronization, ready to provide service.
> 2019-05-14T11:30:16.548481+02:00 ha-idg-1 kernel: [1015439.782587] dlm: 
> closing connection to node 1084777492
> 2019-05-14T11:30:16.555995+02:00 ha-idg-1 dlm_controld[12279]: 1015492 fence

> request 1084777492 pid 24568 nodedown time 1557826216 fence_all dlm_stonith
> 2019-05-14T11:30:16.626285+02:00 ha-idg-1 crmd[6969]:  warning: 
> Stonith/shutdown of node ha-idg-2 was not expected
> 2019-05-14T11:30:16.626534+02:00 ha-idg-1 dlm_stonith: stonith_api_time: 
> Found 1 entries for 1084777492/(null): 0 in progress, 1 completed
> 2019-05-14T11:30:16.626731+02:00 ha-idg-1 dlm_stonith: stonith_api_time: 
> Node 1084777492/(null) last kicked at: 1556884018
> 2019-05-14T11:30:16.626875+02:00 ha-idg-1 stonith-ng[6965]:   notice: Client

> stonith-api.24568.6a9fa406 wants to fence (reboot) '1084777492' with device

> '(any)'
> 2019-05-14T11:30:16.627026+02:00 ha-idg-1 crmd[6969]:   notice: State 
> transition S_IDLE -> S_POLICY_ENGINE
> 2019-05-14T11:30:16.627165+02:00 ha-idg-1 crmd[6969]:   notice: Node 
> ha-idg-2 state is now lost
> 2019-05-14T11:30:16.627302+02:00 ha-idg-1 crmd[6969]:  warning: 
> Stonith/shutdown of node ha-idg-2 was not expected
> 2019-05-14T11:30:16.627439+02:00 ha-idg-1 stonith-ng[6965]:   notice: 
> Requesting peer fencing (reboot) of ha-idg-2
> 2019-05-14T11:30:16.627578+02:00 ha-idg-1 pacemakerd[6963]:   notice: Node 
> ha-idg-2 state is now lost
> ==================================
> 
> One network interface is gone for a short period. But it's in a bonding 
> device (round-robin),
> so the connection shouldn't be lost. Both nodes are connected directly,
> there is no switch in between.

I think you misunderstood: a round-robin bonding device is not fault-safe
IMHO, but it depends a lot on your cabling details. Also you did not show the
logs on the other nodes.

> I manually (ifconfig eth3 down) stopped the interface afterwards several 
> times ... nothing happened.
> The same with the second Interface (eth2).
> ???
> 
> 
> Bernd
> 
> -- 
> 
> Bernd Lentes 
> Systemadministration 
> Institut für Entwicklungsgenetik 
> Gebäude 35.34 - Raum 208 
> HelmholtzZentrum münchen 
> bernd.lentes at helmholtz-muenchen.de 
> phone: +49 89 3187 1241 
> phone: +49 89 3187 3827 
> fax: +49 89 3187 2294 
> http://www.helmholtz-muenchen.de/idg 
> 
> wer Fehler macht kann etwas lernen 
> wer nichts macht kann auch nichts lernen
>  
> 
> Helmholtz Zentrum Muenchen
> Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
> Ingolstaedter Landstr. 1
> 85764 Neuherberg
> www.helmholtz-muenchen.de 
> Stellv. Aufsichtsratsvorsitzender: MinDirig. Dr. Manfred Wolter
> Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Heinrich 
> Bassler, Kerstin Guenther
> Registergericht: Amtsgericht Muenchen HRB 6466
> USt-IdNr: DE 129521671
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/