[ClusterLabs] Howto stonith in the case of any interface failure?

Wed Oct 9 04:50:42 EDT 2019

On 09/10/19 09:58 +0200, Kadlecsik József wrote:
> The nodes in our cluster have got backend and frontend interfaces: the 
> former ones are for the storage and cluster (corosync) traffic and the 
> latter ones are for the public services of KVM guests only.
> 
> One of the nodes has got a failure ("watchdog: BUG: soft lockup - CPU#7 
> stuck for 23s"), which resulted that the node could process traffic on the 
> backend interface but not on the fronted one. Thus the services became 
> unavailable but the cluster thought the node is all right and did not 
> stonith it. 
> 
> How could we protect the cluster against such failures?
> 
> We could configure a second corosync ring, but that would be a redundancy 
> ring only.
> 
> We could setup a second, independent corosync configuration for a second 
> pacemaker just with stonith agents. Is it enough to specify the cluster 
> name in the corosync config to pair pacemaker to corosync? What about the 
> pairing of pacemaker to this corosync instance, how can we tell pacemaker 
> to connect to this corosync instance?

Such pairing happens on the Unix socket system-wide singleton basis.
IOW, two instances of the corosync on the same machine would
apparently conflict -- only a single daemon can run at a time.

> Which is the best way to solve the problem? 

Looks like heuristics of corosync-qdevice that would ping/attest your
frontend interface could be a way to go.  You'd need an additional
host in your setup, though.

-- 
Poki
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 819 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20191009/2ef7dbf4/attachment.sig>