[ClusterLabs] odd cluster failure

Fri Feb 3 14:59:22 EST 2017

(Apologies if this is a duplicate. I accidentally posted to the old
linux-ha.org address, and I couldn't tell from the auto-reply whether my
message was actually posted to the list or not).

For the second time in a few weeks, we have had one node of a particular
cluster getting fenced. It isn't totally clear why this is happening. On
the surviving node I see:

Feb  2 16:48:52 vmc1 stonith-ng[4331]:   notice: stonith-vm2 can fence
(reboot) vmc2.ucar.edu: static-list
Feb  2 16:48:52 vmc1 stonith-ng[4331]:   notice: stonith-vm2 can fence
(reboot) vmc2.ucar.edu: static-list
Feb  2 16:49:00 vmc1 kernel: igb 0000:03:00.1 eth3: igb: eth3 NIC Link is
Down
Feb  2 16:49:00 vmc1 kernel: xenbr0: port 1(eth3) entered disabled state
Feb  2 16:49:01 vmc1 corosync[2846]:   [TOTEM ] A processor failed, forming
new configuration.

OK, so from this point of view, it looks like the link was lost between the
two hosts, resulting in fencing. The link is a crossover cable, so no
networking hardware other than the host NICs and the cable.

On the other side I see:

Feb  2 16:46:46 vmc2 kernel: xenbr1: port 16(vif17.0) entered disabled state
Feb  2 16:46:46 vmc2 kernel: xen:balloon: Cannot add additional memory (-17)
Feb  2 16:46:47 vmc2 kernel: xen:balloon: Cannot add additional memory (-17)
Feb  2 16:46:48 vmc2 kernel: xenbr1: port 16(vif17.0) entered disabled state
Feb  2 16:46:48 vmc2 kernel: device vif17.0 left promiscuous mode
Feb  2 16:46:48 vmc2 kernel: xenbr1: port 16(vif17.0) entered disabled state
Feb  2 16:46:48 vmc2 kernel: xen:balloon: Cannot add additional memory (-17)
Feb  2 16:46:49 vmc2 crmd[4191]:   notice: State transition S_IDLE ->
S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL
origin=abort_transition_graph ]
Feb  2 16:46:49 vmc2 attrd[4189]:   notice: Sending flush op to all hosts
for: fail-count-VM-radnets (1)
Feb  2 16:46:49 vmc2 attrd[4189]:   notice: Sent update 37:
fail-count-VM-radnets=1
Feb  2 16:46:49 vmc2 attrd[4189]:   notice: Sending flush op to all hosts
for: last-failure-VM-radnets (1486079209)
Feb  2 16:46:49 vmc2 attrd[4189]:   notice: Sent update 39:
last-failure-VM-radnets=1486079209
Feb  2 16:46:50 vmc2 pengine[4190]:   notice: On loss of CCM Quorum: Ignore
Feb  2 16:46:50 vmc2 pengine[4190]:  warning: Processing failed op monitor
for VM-radnets on vmc2.ucar.edu: not running (7)
Feb  2 16:46:50 vmc2 pengine[4190]:   notice: Recover
VM-radnets#011(Started vmc2.ucar.edu)
Feb  2 16:46:50 vmc2 pengine[4190]:   notice: Calculated Transition 2914:
/var/lib/pacemaker/pengine/pe-input-317.bz2
Feb  2 16:46:50 vmc2 crmd[4191]:   notice: Initiating action 15: stop
VM-radnets_stop_0 on vmc2.ucar.edu (local)
Feb  2 16:46:51 vmc2 Xen(VM-radnets)[1016]: INFO: Xen domain radnets will
be stopped (timeout: 80s)
Feb  2 16:46:52 vmc2 kernel: device vif21.0 entered promiscuous mode
Feb  2 16:46:52 vmc2 kernel: IPv6: ADDRCONF(NETDEV_UP): vif21.0: link is
not ready
Feb  2 16:46:57 vmc2 kernel: xen-blkback:ring-ref 9, event-channel 10,
protocol 1 (x86_64-abi)
Feb  2 16:46:57 vmc2 kernel: vif vif-21-0 vif21.0: Guest Rx ready
Feb  2 16:46:57 vmc2 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): vif21.0: link
becomes ready
Feb  2 16:46:57 vmc2 kernel: xenbr1: port 2(vif21.0) entered forwarding
state
Feb  2 16:46:57 vmc2 kernel: xenbr1: port 2(vif21.0) entered forwarding
state
Feb  2 16:47:12 vmc2 kernel: xenbr1: port 2(vif21.0) entered forwarding
state

 (and then there are a bunch of null bytes, and the log resumes with reboot)

More messages about networking, except that xenbr1 is not the bridge device
associated with the NIC in question.

I don't see any reason why the link between the hosts should suddenly stop
working, so I am suspecting a hardware problem that only crops up rarely
(but will most likely get worse over time).
Is there anything anyone can see in the log that would suggest otherwise?

Thank you,
--Greg
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20170203/83f9c61e/attachment-0002.html>