[ClusterLabs] Pacemaker occasionally takes minutes to respond

Wed May 24 09:04:02 EDT 2017

Hi Klaus,

Thank you for your response.
I tried many things, but no luck.

We have many pacemaker clusters with 99% identical configurations, package versions, and only this one causes issues. (BTW we use unicast for corosync, but this is the same for our other clusters as well.)
I checked all connection settings between the nodes (to confirm there are no firewall issues), increased the number of cores on each node, but still - as long as a monitor operation is pending for a resource, no other operation is executed.

e.g. resource A is being monitored, and timeout is 90 seconds, until this check times out I cannot do a cleanup or start/stop on any other resource.

Two more interesting things: 
- cluster recheck is set to 2 minutes, and even though the resources are running properly, the fail counters are not reduced and crm_mon lists the resources in failed actions section. forever. Or until I manually do resource cleanup.
- If i execute a crm resource cleanup RES_name from another node, sometimes it simply does not clean up the failed state. If I execute this from the node where the resource IS actually runing, the resource is removed from the failed actions.

What do you recommend, how could I start troubleshooting these issues? As I said, this setup works fine in several other systems, but here I am really-realy stuck.

thanks!

Attila

> -----Original Message-----
> From: Klaus Wenninger [mailto:kwenning at redhat.com]
> Sent: Wednesday, May 10, 2017 2:04 PM
> To: users at clusterlabs.org
> Subject: Re: [ClusterLabs] Pacemaker occasionally takes minutes to respond
> 
> On 05/09/2017 10:34 PM, Attila Megyeri wrote:
> >
> > Actually I found some more details:
> >
> >
> >
> > there are two resources: A and B
> >
> >
> >
> > resource B depends on resource A (when the RA monitors B, if will fail
> > if A is not running properly)
> >
> >
> >
> > If I stop resource A, the next monitor operation of "B" will fail.
> > Interestingly, this check happens immediately after A is stopped.
> >
> >
> >
> > B is configured to restart if monitor fails. Start timeout is rather
> > long, 180 seconds. So pacemaker tries to restart B, and waits.
> >
> >
> >
> > If I want to start "A", nothing happens until the start operation of
> > "B" fails - typically several minutes.
> >
> >
> >
> >
> >
> > Is this the right behavior?
> >
> > It appears that pacemaker is blocked until resource B is being
> > started, and I cannot really start its dependency...
> >
> > Shouldn't it be possible to start a resource while another resource is
> > also starting?
> >
> 
> As long as resources don't depend on each other parallel starting should
> work/happen.
> 
> The number of parallel actions executed is derived from the number of
> cores and
> when load is detected some kind of throttling kicks in (in fact reduction of
> the operations executed in parallel with the aim to reduce the load induced
> by pacemaker). When throttling kicks in you should get log messages (there
> is in fact a parallel discussion going on ...).
> No idea if throttling might be a reason here but maybe worth considering
> at least.
> 
> Another reason why certain things happen with quite some delay I've
> observed
> is that obviously some situations are just resolved when the
> cluster-recheck-interval
> triggers a pengine run in addition to those triggered by changes.
> You might easily verify this by changing the cluster-recheck-interval.
> 
> Regards,
> Klaus
> 
> >
> >
> >
> >
> > Thanks,
> >
> > Attila
> >
> >
> >
> >
> >
> > *From:*Attila Megyeri [mailto:amegyeri at minerva-soft.com]
> > *Sent:* Tuesday, May 9, 2017 9:53 PM
> > *To:* users at clusterlabs.org; kgaillot at redhat.com
> > *Subject:* [ClusterLabs] Pacemaker occasionally takes minutes to respond
> >
> >
> >
> > Hi Ken, all,
> >
> >
> >
> >
> >
> > We ran into an issue very similar to the one described in
> > https://bugzilla.redhat.com/show_bug.cgi?id=1430112 /  [Intel 7.4 Bug]
> > Pacemaker occasionally takes minutes to respond
> >
> >
> >
> > But  in our case we are not using fencing/stonith at all.
> >
> >
> >
> > Many times when I want to start/stop/cleanup a resource, it takes tens
> > of seconds (or even minutes) till the command gets executed. The logs
> > show nothing in that period, the redundant rings show no fault.
> >
> >
> >
> > Could this be the same issue?
> >
> >
> >
> > Any hints on how to troubleshoot this?
> >
> > It is  pacemaker 1.1.10, corosync 2.3.3
> >
> >
> >
> >
> >
> > Cheers,
> >
> > Attila
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > _______________________________________________
> > Users mailing list: Users at clusterlabs.org
> > http://lists.clusterlabs.org/mailman/listinfo/users
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> 
> 
> --
> Klaus Wenninger
> 
> Senior Software Engineer, EMEA ENG Openstack Infrastructure
> 
> Red Hat
> 
> kwenning at redhat.com
> 
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org