[ClusterLabs] pacemaker-controld getting respawned

Tue Jan 7 09:40:51 EST 2020

On 06/01/20 11:53 -0600, Ken Gaillot wrote:
> On Fri, 2020-01-03 at 13:23 +0000, S Sathish S wrote:
>> Pacemaker-controld process is getting restarted frequently reason for
>> failure disconnect from CIB/Internal Error (or) high cpu on the
>> system, same has been recorded in our system logs, Please find the
>> pacemaker and corosync version installed on the system.  
>>  
>> Kindly let us know why we are getting below error on the system.
>>  
>> corosync-2.4.4 à  https://github.com/corosync/corosync/tree/v2.4.4
>> pacemaker-2.0.2 à https://github.com/ClusterLabs/pacemaker/tree/Pacemaker-2.0.2

libqb version is missing (to be explained later on)

>> [root at vmc0621 ~]# ps -eo pid,lstart,cmd  | grep -iE
>> 'corosync|pacemaker' | grep -v grep
>> 2039 Wed Dec 25 15:56:15 2019 corosync
>> 3048 Wed Dec 25 15:56:15 2019 /usr/sbin/pacemakerd -f
>> 3101 Wed Dec 25 15:56:15 2019 /usr/libexec/pacemaker/pacemaker-based
>> 3102 Wed Dec 25 15:56:15 2019 /usr/libexec/pacemaker/pacemaker-fenced
>> 3103 Wed Dec 25 15:56:15 2019 /usr/libexec/pacemaker/pacemaker-execd
>> 3104 Wed Dec 25 15:56:15 2019 /usr/libexec/pacemaker/pacemaker-attrd
>> 3105 Wed Dec 25 15:56:15 2019 /usr/libexec/pacemaker/pacemaker-
>> schedulerd
>> 25371 Tue Dec 31 17:38:53 2019 /usr/libexec/pacemaker/pacemaker-
>> controld
>>  
>>  
>> In system message logs :
>>  
>> Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: error: Node update 4419 failed: Timer expired (-62)
>> Dec 30 10:02:37 vmc0621 pacemaker-controld[7517]: error: Node update 4420 failed: Timer expired (-62)
> 
> This means that the controller is not getting a response back from the
> CIB manager (pacemaker-based) within a reasonable time. If the DC can't
> record the status of nodes, it can't make correct decisions, so it has
> no choice but to exit (which should lead another node to fence it).

I am not sure if it would be feasible in this other, mutual daemon
relationship, but my first idea was that it might have something to
do with deadlock-prone arrangement of prioririties, akin to what
was resolved between pacemaker-fenced and pacemaker-based
(perhaps -based would be bombarding -controld with updates rather
than responding to some of its prior queries?) not too long ago:
https://github.com/ClusterLabs/pacemaker/commit/3401f25994e8cc059898550082f9b75f2d07f103

Satish, you haven't included any metrics of you cluster (nodes #,
resources #, load of the affected machine/all machines around the
problem occurrence), nor you provided wider excerpts of the log.

All in all, I'd start with updating libqb to 1.9.0 that supposedly
contains https://github.com/ClusterLabs/libqb/pull/352, fix for
the event priority concerning glitch, just in case.

> The default timeout is the number of active nodes in the cluster times
> 10 seconds, with a minimum of 30 seconds. That's a lot of time, so I
> would be concerned if the CIB isn't responsive for that long.
> 
> The logs from pacemaker-based before this point might be helpful,
> although if it's not getting scheduled any CPU time there wouldn't be
> any indication of that.
> 
> It is possible to set the timeout explicitly using the PCMK_cib_timeout
> environment variable, but the underlying problem would be likely to
> cause other issues.

-- 
Poki
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 819 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20200107/f20327f2/attachment.sig>