[ClusterLabs] Howto stonith in the case of any interface failure?

Wed Oct 9 11:33:42 EDT 2019

On 2019-10-09 3:58 a.m., Kadlecsik József wrote:
> Hello,
> 
> The nodes in our cluster have got backend and frontend interfaces: the 
> former ones are for the storage and cluster (corosync) traffic and the 
> latter ones are for the public services of KVM guests only.
> 
> One of the nodes has got a failure ("watchdog: BUG: soft lockup - CPU#7 
> stuck for 23s"), which resulted that the node could process traffic on the 
> backend interface but not on the fronted one. Thus the services became 
> unavailable but the cluster thought the node is all right and did not 
> stonith it. 
> 
> How could we protect the cluster against such failures?
> 
> We could configure a second corosync ring, but that would be a redundancy 
> ring only.
> 
> We could setup a second, independent corosync configuration for a second 
> pacemaker just with stonith agents. Is it enough to specify the cluster 
> name in the corosync config to pair pacemaker to corosync? What about the 
> pairing of pacemaker to this corosync instance, how can we tell pacemaker 
> to connect to this corosync instance?
> 
> Which is the best way to solve the problem? 
> 
> Best regards,
> Jozsef

We use mode=1 (active-passive) bonded network interfaces for each
network connection (we also have a back-end, front-end and a storage
network). Each bond has a link going to one switch and the other link to
a second switch. For fence devices, we use IPMI fencing connected via
switch 1 and PDU fencing as the backup method connected on switch 2.

With this setup, no matter what might fail, one of the fence methods
will still be available. It's saved us in the field a few times now.

-- 
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of
Einstein’s brain than in the near certainty that people of equal talent
have lived and died in cotton fields and sweatshops." - Stephen Jay Gould