[ClusterLabs] Howto stonith in the case of any interface failure?

Wed Oct 9 05:52:21 EDT 2019

On Wed, Oct 9, 2019 at 10:59 AM Kadlecsik József
<kadlecsik.jozsef at wigner.mta.hu> wrote:
>
> Hello,
>
> The nodes in our cluster have got backend and frontend interfaces: the
> former ones are for the storage and cluster (corosync) traffic and the
> latter ones are for the public services of KVM guests only.
>
> One of the nodes has got a failure ("watchdog: BUG: soft lockup - CPU#7
> stuck for 23s"), which resulted that the node could process traffic on the
> backend interface but not on the fronted one. Thus the services became
> unavailable but the cluster thought the node is all right and did not
> stonith it.
>
> How could we protect the cluster against such failures?
>
> We could configure a second corosync ring, but that would be a redundancy
> ring only.
>
> We could setup a second, independent corosync configuration for a second
> pacemaker just with stonith agents. Is it enough to specify the cluster
> name in the corosync config to pair pacemaker to corosync? What about the
> pairing of pacemaker to this corosync instance, how can we tell pacemaker
> to connect to this corosync instance?
>
> Which is the best way to solve the problem?
>

That really depends on what "node could process traffic" means. If it
is just about basic IP connectivity, you can use ocf:pacemaker:ping
resource to monitor network availability and move resource if current
node is considered "unconnected". This is actually documented in
Pacemaker Explained, 8.3.2. Moving Resources Due to Connectivity
Changes.

If "process traffic" means something else, you need custom agent that
implements whatever checks are necessary to decide that node cannot
process traffic anymore.