[ClusterLabs] Pacemaker resource start delay when there are another resource is starting

Mon Nov 6 11:22:37 CET 2017

Hi!

Not saying that the use of start-delay in the monitor-operations is
a good thing. It should in most cases be definitely better to delay
the return of start till a monitor would succeed. Have seen discussion
about deprecating start-delay - don't know the current state though.
But this case - if I got the use-case right - with a 10min delay might
be a  legitimate use of start-delay - if any does exists at all ;-)

Regards,
Klaus

On 11/04/2017 03:46 PM, lkxjtu wrote:
>
>
> >Another possibility would be to have the start return immediately, and
> >make the monitor artificially return success for the first 10 minutes
> >after starting. It's hacky, and it depends on your situation whether
> >the behavior is acceptable.
> <http://fanyi.baidu.com/###>
> I tried to put the sleep into the monitor function,( I add a “sleep
> 60” at the monitor entry for debug),  the start function returns
> immediately.I found an interesting thing that is, at the first time of
> monitor after start, it will block other resource too, but from the
> second time, it won't block other resources! Is this normal?
>
> >My first thought on how to implement this
> >would be to have the start action set a private node attribute
> >(attrd_updater -p) with a timestamp. When the monitor runs, it could do
> >its usual check, and if it succeeds, remove that node attribute, but if
> >it fails, check the node attribute to see whether it's within the
> >desired delay.
> This means that if it is in the desired delay， monitor should return success even if healthcheck failed？
> I think this can solve my problem except "crm status" show
>
>
> At 2017-11-01 21:20:50, "Ken Gaillot" <kgaillot at redhat.com> wrote:
> >On Sat, 2017-10-28 at 01:11 +0800, lkxjtu wrote:
> >> 
> >> Thank you for your response! This means that there shoudn't be long
> >> "sleep" in ocf script.
> >> If my service takes 10 minite from service starting to healthcheck
> >> normally, then what shoud I do?
> >
> >That is a tough situation with no great answer.
> >
> >You can leave it as it is, and live with the delay. Note that it only
> >happens if a resource fails after the slow resource has already begun
> >starting ... if they fail at the same time (as with a node failure),
> >the cluster will schedule recovery for both at the same time.
> >
> >Another possibility would be to have the start return immediately, and
> >make the monitor artificially return success for the first 10 minutes
> >after starting. It's hacky, and it depends on your situation whether
> >the behavior is acceptable. My first thought on how to implement this
> >would be to have the start action set a private node attribute
> >(attrd_updater -p) with a timestamp. When the monitor runs, it could do
> >its usual check, and if it succeeds, remove that node attribute, but if
> >it fails, check the node attribute to see whether it's within the
> >desired delay.
> >
> >> Thank you very much!
> >>  
> >> > Hi,
> >> > If I remember correctly, any pending actions from a previous
> >> transition
> >> > must be completed before a new transition can be calculated.
> >> Otherwise,
> >> > there's the possibility that the pending action could change the
> >> state
> >> > in a way that makes the second transition's decisions harmful.
> >> > Theoretically (and ideally), pacemaker could figure out whether
> >> some of
> >> > the actions in the second transition would be needed regardless of
> >> > whether the pending actions succeeded or failed, but in practice,
> >> that
> >> > would be difficult to implement (and possibly take more time to
> >> > calculate than is desirable in a recovery situation).
> >>  
> >> > On Fri, 2017-10-27 at 23:54 +0800, lkxjtu wrote:
> >> 
> >> > I have two clone resources in my corosync/pacemaker cluster. They
> >> are
> >> > fm_mgt and logserver. Both of their RA is ocf. fm_mgt takes 1
> >> minute
> >> > to start the
> >> > service(calling ocf start function for 1 minite). Configured as
> >> > below：
> >> > # crm configure show
> >> > node 168002177: 192.168.2.177
> >> > node 168002178: 192.168.2.178
> >> > node 168002179: 192.168.2.179
> >> > primitive fm_mgt fm_mgt \
> >> >         op monitor interval=20s timeout=120s \
> >> >         op stop interval=0 timeout=120s on-fail=restart \
> >> >         op start interval=0 timeout=120s on-fail=restart \
> >> >         meta target-role=Started
> >> > primitive logserver logserver \
> >> >         op monitor interval=20s timeout=120s \
> >> >         op stop interval=0 timeout=120s on-fail=restart \
> >> >         op start interval=0 timeout=120s on-fail=restart \
> >> >         meta target-role=Started
> >> > clone fm_mgt_replica fm_mgt
> >> > clone logserver_replica logserver
> >> > property cib-bootstrap-options: \
> >> >         have-watchdog=false \
> >> >         dc-version=1.1.13-10.el7-44eb2dd \
> >> >         cluster-infrastructure=corosync \
> >> >         stonith-enabled=false \
> >> >         start-failure-is-fatal=false
> >> > When I kill fm_mgt service on one node，pacemaker will immediately
> >> > recover it after monitor failed. This looks perfectly normal. But
> >> in
> >> > this 1 minite
> >> > of fm_mgt starting, if I kill logserver service on any node, the
> >> > monitor will catch the fail normally too，but pacemaker will not
> >> > restart it
> >> > immediately but waiting for fm_mgt starting finished. After fm_mgt
> >> > starting finished, pacemaker begin restarting logserver. It seems
> >> > that there are
> >> > some dependency between pacemaker resource.
> >> > # crm status
> >> > Last updated: Thu Oct 26 06:40:24 2017          Last change: Thu
> >> Oct
> >> > 26     06:36:33 2017 by root via crm_resource on 192.168.2.177
> >> > Stack: corosync
> >> > Current DC: 192.168.2.179 (version 1.1.13-10.el7-44eb2dd) -
> >> partition
> >> > with quorum
> >> > 3 nodes and 6 resources configured
> >> > Online: [ 192.168.2.177 192.168.2.178 192.168.2.179 ]
> >> > Full list of resources:
> >> >  Clone Set: logserver_replica [logserver]
> >> >      logserver  (ocf::heartbeat:logserver):     FAILED
> >> 192.168.2.177
> >> >      Started: [ 192.168.2.178 192.168.2.179 ]
> >> >  Clone Set: fm_mgt_replica [fm_mgt]
> >> >      Started: [ 192.168.2.178 192.168.2.179 ]
> >> >      Stopped: [ 192.168.2.177 ]
> >> > I am confusing very much. Is there something wrong configure?Thank
> >> > you very much!
> >> > James
> >> > best regards
> >>  
> >> 
> >> 
> >> 【网易自营】好吃到爆！鲜香弹滑加热即食，经典13香/麻辣小龙虾仅75元3斤>>      
> >> 
> >> 
> >> 【网易自营】好吃到爆！鲜香弹滑加热即食，经典13香/麻辣小龙虾仅75元3斤>>      
> >> 
> >> 
> >> 【网易自营|30天无忧退货】仅售同款价1/4！MUJI制造商“2017秋冬舒适家居拖鞋系列”限时仅34.9元>>      
> >-- 
> >Ken Gaillot <kgaillot at redhat.com>
>
>
>  
>
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org