[ClusterLabs] Pacemaker resource start delay when there are another resource is starting

Fri Oct 27 12:18:25 EDT 2017

Hi,

If I remember correctly, any pending actions from a previous transition
must be completed before a new transition can be calculated. Otherwise,
there's the possibility that the pending action could change the state
in a way that makes the second transition's decisions harmful.

Theoretically (and ideally), pacemaker could figure out whether some of
the actions in the second transition would be needed regardless of
whether the pending actions succeeded or failed, but in practice, that
would be difficult to implement (and possibly take more time to
calculate than is desirable in a recovery situation).

On Fri, 2017-10-27 at 23:54 +0800, lkxjtu wrote:
> I have two clone resources in my corosync/pacemaker cluster. They are
> fm_mgt and logserver. Both of their RA is ocf. fm_mgt takes 1 minute
> to start the
> service(calling ocf start function for 1 minite). Configured as
> below：
> # crm configure show
> node 168002177: 192.168.2.177
> node 168002178: 192.168.2.178
> node 168002179: 192.168.2.179
> primitive fm_mgt fm_mgt \
>         op monitor interval=20s timeout=120s \
>         op stop interval=0 timeout=120s on-fail=restart \
>         op start interval=0 timeout=120s on-fail=restart \
>         meta target-role=Started
> primitive logserver logserver \
>         op monitor interval=20s timeout=120s \
>         op stop interval=0 timeout=120s on-fail=restart \
>         op start interval=0 timeout=120s on-fail=restart \
>         meta target-role=Started
> clone fm_mgt_replica fm_mgt
> clone logserver_replica logserver
> property cib-bootstrap-options: \
>         have-watchdog=false \
>         dc-version=1.1.13-10.el7-44eb2dd \
>         cluster-infrastructure=corosync \
>         stonith-enabled=false \
>         start-failure-is-fatal=false
> When I kill fm_mgt service on one node，pacemaker will immediately
> recover it after monitor failed. This looks perfectly normal. But in
> this 1 minite
> of fm_mgt starting, if I kill logserver service on any node, the
> monitor will catch the fail normally too，but pacemaker will not
> restart it
> immediately but waiting for fm_mgt starting finished. After fm_mgt
> starting finished, pacemaker begin restarting logserver. It seems
> that there are
> some dependency between pacemaker resource.
> # crm status
> Last updated: Thu Oct 26 06:40:24 2017          Last change: Thu Oct
> 26     06:36:33 2017 by root via crm_resource on 192.168.2.177
> Stack: corosync
> Current DC: 192.168.2.179 (version 1.1.13-10.el7-44eb2dd) - partition
> with quorum
> 3 nodes and 6 resources configured
> Online: [ 192.168.2.177 192.168.2.178 192.168.2.179 ]
> Full list of resources:
>  Clone Set: logserver_replica [logserver]
>      logserver  (ocf::heartbeat:logserver):     FAILED 192.168.2.177
>      Started: [ 192.168.2.178 192.168.2.179 ]
>  Clone Set: fm_mgt_replica [fm_mgt]
>      Started: [ 192.168.2.178 192.168.2.179 ]
>      Stopped: [ 192.168.2.177 ]
> I am confusing very much. Is there something wrong configure?Thank
> you very much!
> James
> best regards
> 
> 
> 【网易自营】好吃到爆！鲜香弹滑加热即食，经典13香/麻辣小龙虾仅75元3斤>>      
> 
> 
> 【网易自营】好吃到爆！鲜香弹滑加热即食，经典13香/麻辣小龙虾仅75元3斤>>      
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.
> pdf
> Bugs: http://bugs.clusterlabs.org
-- 
Ken Gaillot <kgaillot at redhat.com>