[ClusterLabs] why is node fenced ?

Lentes, Bernd bernd.lentes at helmholtz-muenchen.de
Tue Aug 13 07:53:38 EDT 2019


----- On Aug 12, 2019, at 7:47 PM, Chris Walker cwalker at cray.com wrote:

> When ha-idg-1 started Pacemaker around 17:43, it did not see ha-idg-2, for
> example,
> 
> Aug 09 17:43:05 [6318] ha-idg-1 pacemakerd:     info: pcmk_quorum_notification:
> Quorum retained | membership=1320 members=1
> 
> after ~20s (dc-deadtime parameter), ha-idg-2 is marked 'unclean' and STONITHed
> as part of startup fencing.
> 
> There is nothing in ha-idg-2's HA logs around 17:43 indicating that it saw
> ha-idg-1 either, so it appears that there was no communication at all between
> the two nodes.
> 
> I'm not sure exactly why the nodes did not see one another, but there are
> indications of network issues around this time
> 
> 2019-08-09T17:42:16.427947+02:00 ha-idg-2 kernel: [ 1229.245533] bond1: now
> running without any active interface!
> 
> so perhaps that's related.

This is the initialization of the bond1 on ha-idg-1 during boot.
3 seconds later bond1 is fine:

2019-08-09T17:42:19.299886+02:00 ha-idg-2 kernel: [ 1232.117470] tg3 0000:03:04.0 eth2: Link is up at 1000 Mbps, full duplex
2019-08-09T17:42:19.299908+02:00 ha-idg-2 kernel: [ 1232.117482] tg3 0000:03:04.0 eth2: Flow control is on for TX and on for RX
2019-08-09T17:42:19.315756+02:00 ha-idg-2 kernel: [ 1232.131565] tg3 0000:03:04.1 eth3: Link is up at 1000 Mbps, full duplex
2019-08-09T17:42:19.315767+02:00 ha-idg-2 kernel: [ 1232.131568] tg3 0000:03:04.1 eth3: Flow control is on for TX and on for RX
2019-08-09T17:42:19.351781+02:00 ha-idg-2 kernel: [ 1232.169386] bond1: link status definitely up for interface eth2, 1000 Mbps full duplex
2019-08-09T17:42:19.351792+02:00 ha-idg-2 kernel: [ 1232.169390] bond1: making interface eth2 the new active one
2019-08-09T17:42:19.352521+02:00 ha-idg-2 kernel: [ 1232.169473] bond1: first active interface up!
2019-08-09T17:42:19.352532+02:00 ha-idg-2 kernel: [ 1232.169480] bond1: link status definitely up for interface eth3, 1000 Mbps full duplex

also on ha-idg-1:

2019-08-09T17:42:19.168035+02:00 ha-idg-1 kernel: [  110.164250] tg3 0000:02:00.3 eth3: Link is up at 1000 Mbps, full duplex
2019-08-09T17:42:19.168050+02:00 ha-idg-1 kernel: [  110.164252] tg3 0000:02:00.3 eth3: Flow control is on for TX and on for RX
2019-08-09T17:42:19.168052+02:00 ha-idg-1 kernel: [  110.164254] tg3 0000:02:00.3 eth3: EEE is disabled
2019-08-09T17:42:19.172020+02:00 ha-idg-1 kernel: [  110.171378] tg3 0000:02:00.2 eth2: Link is up at 1000 Mbps, full duplex
2019-08-09T17:42:19.172028+02:00 ha-idg-1 kernel: [  110.171380] tg3 0000:02:00.2 eth2: Flow control is on for TX and on for RX
2019-08-09T17:42:19.172029+02:00 ha-idg-1 kernel: [  110.171382] tg3 0000:02:00.2 eth2: EEE is disabled
 ...
2019-08-09T17:42:19.244066+02:00 ha-idg-1 kernel: [  110.240310] bond1: link status definitely up for interface eth2, 1000 Mbps full duplex
2019-08-09T17:42:19.244083+02:00 ha-idg-1 kernel: [  110.240311] bond1: making interface eth2 the new active one
2019-08-09T17:42:19.244085+02:00 ha-idg-1 kernel: [  110.240353] bond1: first active interface up!
2019-08-09T17:42:19.244087+02:00 ha-idg-1 kernel: [  110.240356] bond1: link status definitely up for interface eth3, 1000 Mbps full duplex

And the cluster is started afterwards on ha-idg-1 at 17:43:04. I don't find further entries for problems with bond1. So i think it's not related.
Time is synchronized by ntp.


Bernd
 

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir'in Prof. Dr. Veronika von Messling
Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Heinrich Bassler, Kerstin Guenther
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671



More information about the Users mailing list