[ClusterLabs] Problem with the cluster becoming mostly unresponsive

Fri May 14 19:58:29 EDT 2021

On 2021-05-14 6:06 p.m., kgaillot at redhat.com wrote:
> On Fri, 2021-05-14 at 15:04 -0400, Digimer wrote:
>> Hi all,
>>
>>   I'm run into an issue a couple of times now, and I'm not really
>> sure
>> what's causing it. I've got a RHEL 8 cluster that, after a while,
>> will
>> show one or more resources as 'FAILED'. When I try to do a cleanup,
>> it
>> marks the resources as stopped, despite them still running. After
>> that,
>> all attempts to manage the resources cause no change. The pcs command
>> seems to have no effect, and in some cases refuses to return.
>>
>> The logs from the nodes (filtered for 'pcs' and 'pacem' since boot)
>> are
>> here (resources running on node 2):
>>
>> - 
>> https://www.alteeve.com/files/an-a02n01.pacemaker_hang.2021-05-14.txt
> 
> The SNMP fence agent fails to start:
> 
> May 12 23:29:25 an-a02n01.alteeve.com pacemaker-fenced[5947]:  warning: fence_apc_snmp[12842] stderr: [  ]
> May 12 23:29:25 an-a02n01.alteeve.com pacemaker-fenced[5947]:  warning: fence_apc_snmp[12842] stderr: [  ]
> May 12 23:29:25 an-a02n01.alteeve.com pacemaker-fenced[5947]:  warning: fence_apc_snmp[12842] stderr: [ 2021-05-12 23:29:25,955 ERROR: Please use '-h' for usage ]
> May 12 23:29:25 an-a02n01.alteeve.com pacemaker-fenced[5947]:  warning: fence_apc_snmp[12842] stderr: [  ]
> May 12 23:29:25 an-a02n01.alteeve.com pacemaker-fenced[5947]:  notice: Operation 'monitor' [12842] for device 'apc_snmp_node2_an-pdu02' returned: -201 (Generic Pacemaker error)
> May 12 23:29:25 an-a02n01.alteeve.com pacemaker-controld[5951]:  notice: Result of start operation for apc_snmp_node2_an-pdu02 on an-a02n01: error

I noticed this, but I have no idea why it would have failed... The
'fence_apc_snmp' is the bog-standard fence agent...

> which is fatal (because start-failure-is-fatal=true):
> 
> May 12 23:29:26 an-a02n01.alteeve.com pacemaker-attrd[5949]:  notice: Setting fail-count-apc_snmp_node2_an-pdu01#start_0[an-a02n02]: (unset) -> INFINITY
> May 12 23:29:26 an-a02n01.alteeve.com pacemaker-attrd[5949]:  notice: Setting last-failure-apc_snmp_node2_an-pdu01#start_0[an-a02n02]: (unset) -> 1620876566
> 
> That happens for both devices on both nodes, so they get stopped
> (successfully), which effectively disables them from being used, though
> I don't see them needed in these logs so it wouldn't matter.

So a monitor failure on the fence agent rendered the cluster effectively
unresponsive? How would I normally recover from this?

> It looks like you did a cleanup here:
> 
> May 14 14:19:30 an-a02n01.alteeve.com pacemaker-controld[5951]:  notice: Forcing the status of all resources to be redetected
> 
> It's hard to tell what happened after that without the detail log
> (/var/log/pacemaker/pacemaker.log). The resource history should have
> been wiped from the CIB, and probes of everything should have been
> scheduled and executed. But I don't see any scheduler output, which is
> odd.

Next time I start the cluster, I will truncate the pacemaker log. Then
if/when it fails again (seems to be happening regularly) I'll provide
the pacemaker.log file.

> Then we get a shutdown request, but the node has already left without
> getting the OK to do so:
> 
> May 14 14:22:58 an-a02n01.alteeve.com pacemaker-attrd[5949]:  notice: Setting shutdown[an-a02n02]: (unset) -> 1621016578
> May 14 14:42:58 an-a02n01.alteeve.com pacemaker-controld[5951]:  warning: Stonith/shutdown of node an-a02n02 was not expected
> May 14 14:42:58 an-a02n01.alteeve.com pacemaker-attrd[5949]:  notice: Node an-a02n02 state is now lost
> 
> The log ends there so I'm not sure what happens after that. I'd expect
> this node to want to fence the other one. Since the fence devices are
> failed, that can't happen, so that could be why the node is unable to
> shut down itself.
> 
>> - 
>> https://www.alteeve.com/files/an-a02n02.pacemaker_hang.2021-05-14.txt
>>
>>   For example, it took 20 minutes for the 'pcs cluster stop' to
>> complete. (Note that I tried restarting the pcsd daemon while
>> waiting)
>>
>>   BTW, I see the errors about fence_delay metadata, that will be
>> fixed
>> and I don't believe it's related.
>>
>>   Any advice on what happened, how to avoid it, and how to clean up
>> without a full cluster restart, should it happen again?
>>
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
> 

-- 
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of
Einstein’s brain than in the near certainty that people of equal talent
have lived and died in cotton fields and sweatshops." - Stephen Jay Gould