[ClusterLabs] Problem with the cluster becoming mostly unresponsive

Fri May 14 18:06:44 EDT 2021

On Fri, 2021-05-14 at 15:04 -0400, Digimer wrote:
> Hi all,
> 
>   I'm run into an issue a couple of times now, and I'm not really
> sure
> what's causing it. I've got a RHEL 8 cluster that, after a while,
> will
> show one or more resources as 'FAILED'. When I try to do a cleanup,
> it
> marks the resources as stopped, despite them still running. After
> that,
> all attempts to manage the resources cause no change. The pcs command
> seems to have no effect, and in some cases refuses to return.
> 
> The logs from the nodes (filtered for 'pcs' and 'pacem' since boot)
> are
> here (resources running on node 2):
> 
> - 
> https://www.alteeve.com/files/an-a02n01.pacemaker_hang.2021-05-14.txt

The SNMP fence agent fails to start:

May 12 23:29:25 an-a02n01.alteeve.com pacemaker-fenced[5947]:  warning: fence_apc_snmp[12842] stderr: [  ]
May 12 23:29:25 an-a02n01.alteeve.com pacemaker-fenced[5947]:  warning: fence_apc_snmp[12842] stderr: [  ]
May 12 23:29:25 an-a02n01.alteeve.com pacemaker-fenced[5947]:  warning: fence_apc_snmp[12842] stderr: [ 2021-05-12 23:29:25,955 ERROR: Please use '-h' for usage ]
May 12 23:29:25 an-a02n01.alteeve.com pacemaker-fenced[5947]:  warning: fence_apc_snmp[12842] stderr: [  ]
May 12 23:29:25 an-a02n01.alteeve.com pacemaker-fenced[5947]:  notice: Operation 'monitor' [12842] for device 'apc_snmp_node2_an-pdu02' returned: -201 (Generic Pacemaker error)
May 12 23:29:25 an-a02n01.alteeve.com pacemaker-controld[5951]:  notice: Result of start operation for apc_snmp_node2_an-pdu02 on an-a02n01: error

which is fatal (because start-failure-is-fatal=true):

May 12 23:29:26 an-a02n01.alteeve.com pacemaker-attrd[5949]:  notice: Setting fail-count-apc_snmp_node2_an-pdu01#start_0[an-a02n02]: (unset) -> INFINITY
May 12 23:29:26 an-a02n01.alteeve.com pacemaker-attrd[5949]:  notice: Setting last-failure-apc_snmp_node2_an-pdu01#start_0[an-a02n02]: (unset) -> 1620876566

That happens for both devices on both nodes, so they get stopped
(successfully), which effectively disables them from being used, though
I don't see them needed in these logs so it wouldn't matter.

It looks like you did a cleanup here:

May 14 14:19:30 an-a02n01.alteeve.com pacemaker-controld[5951]:  notice: Forcing the status of all resources to be redetected

It's hard to tell what happened after that without the detail log
(/var/log/pacemaker/pacemaker.log). The resource history should have
been wiped from the CIB, and probes of everything should have been
scheduled and executed. But I don't see any scheduler output, which is
odd.

Then we get a shutdown request, but the node has already left without
getting the OK to do so:

May 14 14:22:58 an-a02n01.alteeve.com pacemaker-attrd[5949]:  notice: Setting shutdown[an-a02n02]: (unset) -> 1621016578
May 14 14:42:58 an-a02n01.alteeve.com pacemaker-controld[5951]:  warning: Stonith/shutdown of node an-a02n02 was not expected
May 14 14:42:58 an-a02n01.alteeve.com pacemaker-attrd[5949]:  notice: Node an-a02n02 state is now lost

The log ends there so I'm not sure what happens after that. I'd expect
this node to want to fence the other one. Since the fence devices are
failed, that can't happen, so that could be why the node is unable to
shut down itself.

> - 
> https://www.alteeve.com/files/an-a02n02.pacemaker_hang.2021-05-14.txt
> 
>   For example, it took 20 minutes for the 'pcs cluster stop' to
> complete. (Note that I tried restarting the pcsd daemon while
> waiting)
> 
>   BTW, I see the errors about fence_delay metadata, that will be
> fixed
> and I don't believe it's related.
> 
>   Any advice on what happened, how to avoid it, and how to clean up
> without a full cluster restart, should it happen again?
>