[ClusterLabs] Howto stonith in the case of any interface failure?

Kadlecsik József kadlecsik.jozsef at wigner.mta.hu
Wed Oct 9 14:16:25 EDT 2019


On Wed, 9 Oct 2019, Digimer wrote:

> > One of the nodes has got a failure ("watchdog: BUG: soft lockup - 
> > CPU#7 stuck for 23s"), which resulted that the node could process 
> > traffic on the backend interface but not on the fronted one. Thus the 
> > services became unavailable but the cluster thought the node is all 
> > right and did not stonith it.
> > 
> > How could we protect the cluster against such failures?
> > 
> We use mode=1 (active-passive) bonded network interfaces for each 
> network connection (we also have a back-end, front-end and a storage 
> network). Each bond has a link going to one switch and the other link to 
> a second switch. For fence devices, we use IPMI fencing connected via 
> switch 1 and PDU fencing as the backup method connected on switch 2.
> 
> With this setup, no matter what might fail, one of the fence methods
> will still be available. It's saved us in the field a few times now.

A bonded interface helps, but I suspect that in this case it could not 
save the situation. It was not an interface failure but a strange kind of 
system lockup: some of the already running processes were fine (corosync), 
but for example sshd could not accept new connections from the direction 
of the seemingly fine backbone interface either.

In the backend direction we have got bonded (LACP) interfaces - the 
frontend uses single interfaces only.

Best regards,
Jozsef
--
E-mail : kadlecsik.jozsef at wigner.mta.hu
PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address: Wigner Research Centre for Physics
         H-1525 Budapest 114, POB. 49, Hungary


More information about the Users mailing list