[ClusterLabs] Pacemaker resource start delay when there are another resource is starting

Wed Nov 1 10:20:50 EDT 2017

On Sat, 2017-10-28 at 01:11 +0800, lkxjtu wrote:
> 
> Thank you for your response! This means that there shoudn't be long
> "sleep" in ocf script.
> If my service takes 10 minite from service starting to healthcheck
> normally, then what shoud I do?

That is a tough situation with no great answer.

You can leave it as it is, and live with the delay. Note that it only
happens if a resource fails after the slow resource has already begun
starting ... if they fail at the same time (as with a node failure),
the cluster will schedule recovery for both at the same time.

Another possibility would be to have the start return immediately, and
make the monitor artificially return success for the first 10 minutes
after starting. It's hacky, and it depends on your situation whether
the behavior is acceptable. My first thought on how to implement this
would be to have the start action set a private node attribute
(attrd_updater -p) with a timestamp. When the monitor runs, it could do
its usual check, and if it succeeds, remove that node attribute, but if
it fails, check the node attribute to see whether it's within the
desired delay.

> Thank you very much!
>  
> > Hi,
> > If I remember correctly, any pending actions from a previous
> transition
> > must be completed before a new transition can be calculated.
> Otherwise,
> > there's the possibility that the pending action could change the
> state
> > in a way that makes the second transition's decisions harmful.
> > Theoretically (and ideally), pacemaker could figure out whether
> some of
> > the actions in the second transition would be needed regardless of
> > whether the pending actions succeeded or failed, but in practice,
> that
> > would be difficult to implement (and possibly take more time to
> > calculate than is desirable in a recovery situation).
>  
> > On Fri, 2017-10-27 at 23:54 +0800, lkxjtu wrote:
> 
> > I have two clone resources in my corosync/pacemaker cluster. They
> are
> > fm_mgt and logserver. Both of their RA is ocf. fm_mgt takes 1
> minute
> > to start the
> > service(calling ocf start function for 1 minite). Configured as
> > below：
> > # crm configure show
> > node 168002177: 192.168.2.177
> > node 168002178: 192.168.2.178
> > node 168002179: 192.168.2.179
> > primitive fm_mgt fm_mgt \
> >         op monitor interval=20s timeout=120s \
> >         op stop interval=0 timeout=120s on-fail=restart \
> >         op start interval=0 timeout=120s on-fail=restart \
> >         meta target-role=Started
> > primitive logserver logserver \
> >         op monitor interval=20s timeout=120s \
> >         op stop interval=0 timeout=120s on-fail=restart \
> >         op start interval=0 timeout=120s on-fail=restart \
> >         meta target-role=Started
> > clone fm_mgt_replica fm_mgt
> > clone logserver_replica logserver
> > property cib-bootstrap-options: \
> >         have-watchdog=false \
> >         dc-version=1.1.13-10.el7-44eb2dd \
> >         cluster-infrastructure=corosync \
> >         stonith-enabled=false \
> >         start-failure-is-fatal=false
> > When I kill fm_mgt service on one node，pacemaker will immediately
> > recover it after monitor failed. This looks perfectly normal. But
> in
> > this 1 minite
> > of fm_mgt starting, if I kill logserver service on any node, the
> > monitor will catch the fail normally too，but pacemaker will not
> > restart it
> > immediately but waiting for fm_mgt starting finished. After fm_mgt
> > starting finished, pacemaker begin restarting logserver. It seems
> > that there are
> > some dependency between pacemaker resource.
> > # crm status
> > Last updated: Thu Oct 26 06:40:24 2017          Last change: Thu
> Oct
> > 26     06:36:33 2017 by root via crm_resource on 192.168.2.177
> > Stack: corosync
> > Current DC: 192.168.2.179 (version 1.1.13-10.el7-44eb2dd) -
> partition
> > with quorum
> > 3 nodes and 6 resources configured
> > Online: [ 192.168.2.177 192.168.2.178 192.168.2.179 ]
> > Full list of resources:
> >  Clone Set: logserver_replica [logserver]
> >      logserver  (ocf::heartbeat:logserver):     FAILED
> 192.168.2.177
> >      Started: [ 192.168.2.178 192.168.2.179 ]
> >  Clone Set: fm_mgt_replica [fm_mgt]
> >      Started: [ 192.168.2.178 192.168.2.179 ]
> >      Stopped: [ 192.168.2.177 ]
> > I am confusing very much. Is there something wrong configure?Thank
> > you very much!
> > James
> > best regards
>  
> 
> 
> 【网易自营】好吃到爆！鲜香弹滑加热即食，经典13香/麻辣小龙虾仅75元3斤>>      
> 
> 
> 【网易自营】好吃到爆！鲜香弹滑加热即食，经典13香/麻辣小龙虾仅75元3斤>>      
> 
> 
> 【网易自营|30天无忧退货】仅售同款价1/4！MUJI制造商“2017秋冬舒适家居拖鞋系列”限时仅34.9元>>      
-- 
Ken Gaillot <kgaillot at redhat.com>