[ClusterLabs] Pacemaker occasionally takes minutes to respond

Wed May 10 08:04:10 EDT 2017

On 05/09/2017 10:34 PM, Attila Megyeri wrote:
>
> Actually I found some more details:
>
>  
>
> there are two resources: A and B
>
>  
>
> resource B depends on resource A (when the RA monitors B, if will fail
> if A is not running properly)
>
>  
>
> If I stop resource A, the next monitor operation of „B” will fail.
> Interestingly, this check happens immediately after A is stopped.
>
>  
>
> B is configured to restart if monitor fails. Start timeout is rather
> long, 180 seconds. So pacemaker tries to restart B, and waits.
>
>  
>
> If I want to start „A”, nothing happens until the start operation of
> „B” fails – typically several minutes.
>
>  
>
>  
>
> Is this the right behavior?
>
> It appears that pacemaker is blocked until resource B is being
> started, and I cannot really start its dependency…
>
> Shouldn’t it be possible to start a resource while another resource is
> also starting?
>

As long as resources don't depend on each other parallel starting should
work/happen.

The number of parallel actions executed is derived from the number of
cores and
when load is detected some kind of throttling kicks in (in fact reduction of
the operations executed in parallel with the aim to reduce the load induced
by pacemaker). When throttling kicks in you should get log messages (there
is in fact a parallel discussion going on ...).
No idea if throttling might be a reason here but maybe worth considering
at least.

Another reason why certain things happen with quite some delay I've observed
is that obviously some situations are just resolved when the
cluster-recheck-interval
triggers a pengine run in addition to those triggered by changes.
You might easily verify this by changing the cluster-recheck-interval.

Regards,
Klaus

>  
>
>  
>
> Thanks,
>
> Attila
>
>  
>
>  
>
> *From:*Attila Megyeri [mailto:amegyeri at minerva-soft.com]
> *Sent:* Tuesday, May 9, 2017 9:53 PM
> *To:* users at clusterlabs.org; kgaillot at redhat.com
> *Subject:* [ClusterLabs] Pacemaker occasionally takes minutes to respond
>
>  
>
> Hi Ken, all,
>
>  
>
>  
>
> We ran into an issue very similar to the one described in
> https://bugzilla.redhat.com/show_bug.cgi?id=1430112 /  [Intel 7.4 Bug]
> Pacemaker occasionally takes minutes to respond
>
>  
>
> But  in our case we are not using fencing/stonith at all.
>
>  
>
> Many times when I want to start/stop/cleanup a resource, it takes tens
> of seconds (or even minutes) till the command gets executed. The logs
> show nothing in that period, the redundant rings show no fault.
>
>  
>
> Could this be the same issue?
>
>  
>
> Any hints on how to troubleshoot this?
>
> It is  pacemaker 1.1.10, corosync 2.3.3
>
>  
>
>  
>
> Cheers,
>
> Attila
>
>  
>
>  
>
>  
>
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

-- 
Klaus Wenninger

Senior Software Engineer, EMEA ENG Openstack Infrastructure

Red Hat

kwenning at redhat.com