[ClusterLabs] systemd RA start/stop delays

Tue Aug 30 10:18:06 UTC 2016

Hi,

On Thu, Aug 18, 2016 at 09:00:24AM -0500, Ken Gaillot wrote:
> On 08/17/2016 08:17 PM, TEG AMJG wrote:
> > Hi
> > 
> > I am having a problem with a simple Active/Passive cluster which
> > consists in the next configuration
> > 
> > Cluster Name: kamcluster
> > Corosync Nodes:
> >  kam1vs3 kam2vs3
> > Pacemaker Nodes:
> >  kam1vs3 kam2vs3
> > 
> > Resources:
> >  Resource: ClusterIP (class=ocf provider=heartbeat type=IPaddr2)
> >   Attributes: ip=10.0.1.206 cidr_netmask=32
> >   Operations: start interval=0s timeout=20s (ClusterIP-start-interval-0s)
> >               stop interval=0s timeout=20s (ClusterIP-stop-interval-0s)
> >               monitor interval=10s (ClusterIP-monitor-interval-10s)
> >  Resource: ClusterIP2 (class=ocf provider=heartbeat type=IPaddr2)
> >   Attributes: ip=10.0.1.207 cidr_netmask=32
> >   Operations: start interval=0s timeout=20s (ClusterIP2-start-interval-0s)
> >               stop interval=0s timeout=20s (ClusterIP2-stop-interval-0s)
> >               monitor interval=10s (ClusterIP2-monitor-interval-10s)
> >  Resource: rtpproxycluster (class=systemd type=rtpproxy)
> >   Operations: monitor interval=10s (rtpproxycluster-monitor-interval-10s)
> >               stop interval=0s on-fail=block
> > (rtpproxycluster-stop-interval-0s)
> >  Resource: kamailioetcfs (class=ocf provider=heartbeat type=Filesystem)
> >   Attributes: device=/dev/drbd1 directory=/etc/kamailio fstype=ext4
> >   Operations: start interval=0s timeout=60 (kamailioetcfs-start-interval-0s)
> >               monitor interval=10s on-fail=fence
> > (kamailioetcfs-monitor-interval-1                                      
> >                                                0s)
> >               stop interval=0s on-fail=fence
> > (kamailioetcfs-stop-interval-0s)
> >  Clone: fence_kam2_xvm-clone
> >   Meta Attrs: interleave=true clone-max=2 clone-node-max=1
> >   Resource: fence_kam2_xvm (class=stonith type=fence_xvm)
> >    Attributes: port=tegamjg_kam2 pcmk_host_list=kam2vs3
> >    Operations: monitor interval=60s (fence_kam2_xvm-monitor-interval-60s)
> >  Master: kamailioetcclone
> >   Meta Attrs: master-max=1 master-node-max=1 clone-max=2
> > clone-node-max=1 notify=t                                              
> >                                        rue on-fail=fence
> >   Resource: kamailioetc (class=ocf provider=linbit type=drbd)
> >    Attributes: drbd_resource=kamailioetc
> >    Operations: start interval=0s timeout=240 (kamailioetc-start-interval-0s)
> >                promote interval=0s on-fail=fence
> > (kamailioetc-promote-interval-0s)
> >                demote interval=0s on-fail=fence
> > (kamailioetc-demote-interval-0s)
> >                stop interval=0s on-fail=fence (kamailioetc-stop-interval-0s)
> >                monitor interval=10s (kamailioetc-monitor-interval-10s)
> >  Clone: fence_kam1_xvm-clone
> >   Meta Attrs: interleave=true clone-max=2 clone-node-max=1
> >   Resource: fence_kam1_xvm (class=stonith type=fence_xvm)
> >    Attributes: port=tegamjg_kam1 pcmk_host_list=kam1vs3
> >    Operations: monitor interval=60s (fence_kam1_xvm-monitor-interval-60s)
> >  Resource: kamailiocluster (class=ocf provider=heartbeat type=kamailio)
> >   Attributes: listen_address=10.0.1.206
> > conffile=/etc/kamailio/kamailio.cfg pidfil                              
> >                                                      
> >  e=/var/run/kamailio.pid monitoring_ip=10.0.1.206
> > monitoring_ip2=10.0.1.207 port=50                                      
> >                                                60 proto=udp
> > kamctlrc=/etc/kamailio/kamctlrc shmem=128 pkg=8
> >   Meta Attrs: target-role=Stopped
> >   Operations: start interval=0s timeout=60
> > (kamailiocluster-start-interval-0s)
> >               stop interval=0s timeout=30 (kamailiocluster-stop-interval-0s)
> >               monitor interval=5s (kamailiocluster-monitor-interval-5s)
> > 
> > Stonith Devices:
> > Fencing Levels:
> > 
> > Location Constraints:
> > Ordering Constraints:
> >   start fence_kam1_xvm-clone then start fence_kam2_xvm-clone
> > (kind:Mandatory) (id:                                                  
> >                                  
> >  order-fence_kam1_xvm-clone-fence_kam2_xvm-clone-mandatory)
> >   start fence_kam2_xvm-clone then promote kamailioetcclone
> > (kind:Mandatory) (id:or                                                
> >                                    
> >  der-fence_kam2_xvm-clone-kamailioetcclone-mandatory)
> >   promote kamailioetcclone then start kamailioetcfs (kind:Optional)
> > (id:order-kama                                                          
> >                            ilioetcclone-kamailioetcfs-Optional)
> >   Resource Sets:
> >     set kamailioetcfs sequential=true (id:pcs_rsc_set_kamailioetcfs) set
> > ClusterIP                                                              
> >                         ClusterIP2 sequential=false
> > (id:pcs_rsc_set_ClusterIP_ClusterIP2) set rtpproxyclu                  
> >                                                                    ster
> > kamailiocluster sequential=true
> > (id:pcs_rsc_set_rtpproxycluster_kamailioclust                          
> >                                                            er)
> > (id:pcs_rsc_order_set_kamailioetcfs_set_ClusterIP_ClusterIP2_set_rtpproxyclust
> >                                                                        
> >              er_kamailiocluster)
> > Colocation Constraints:
> >   rtpproxycluster with ClusterIP2 (score:INFINITY)
> > (id:colocation-rtpproxycluster-                                        
> >                                              ClusterIP2-INFINITY)
> >   ClusterIP2 with ClusterIP (score:INFINITY)
> > (id:colocation-ClusterIP2-ClusterIP-I                                  
> >                                                    NFINITY)
> >   ClusterIP with kamailioetcfs (score:INFINITY)
> > (id:colocation-ClusterIP-kamailioe                                      
> >                                                tcfs-INFINITY)
> >   kamailioetcfs with kamailioetcclone (score:INFINITY)
> > (with-rsc-role:Master) (id:                                            
> >                                        
> >  colocation-kamailioetcfs-kamailioetcclone-INFINITY)
> >   fence_kam2_xvm-clone with fence_kam1_xvm-clone (score:INFINITY)
> > (id:colocation-f                                                        
> >                            
> >  ence_kam2_xvm-clone-fence_kam1_xvm-clone-INFINITY)
> >   kamailioetcclone with fence_kam2_xvm-clone (score:INFINITY)
> > (id:colocation-kamai                                                    
> >                                  lioetcclone-fence_kam2_xvm-clone-INFINITY)
> >   kamailiocluster with rtpproxycluster (score:INFINITY)
> > (id:colocation-kamailioclu                                              
> >                                        ster-rtpproxycluster-INFINITY)
> > 
> > Resources Defaults:
> >  migration-threshold: 2
> >  failure-timeout: 10m
> >  resource-stickiness: 200
> > Operations Defaults:
> >  No defaults set
> > 
> > Cluster Properties:
> >  cluster-infrastructure: corosync
> >  cluster-name: kamcluster
> >  dc-version: 1.1.13-10.el7_2.2-44eb2dd
> >  have-watchdog: false
> >  last-lrm-refresh: 1471479940
> >  no-quorum-policy: ignore
> >  start-failure-is-fatal: false
> >  stonith-action: reboot
> >  stonith-enabled: true
> > 
> > Now my problem is when the rtpproxy systemd resource starts/stops, i am
> > experiencing 2 seconds delay between the call from the start/stop signal
> > is sent to systemd and the confirmation and interpretation of the action
> > to the crmd. I am pretty sure that the initiation of this service doesnt
> 
> The problem is that systemd does not have a way to say "start this
> service, and don't tell me it's done until it's completely started".
> Systemd reports successful completion as soon as the start has been
> initiated.

The same holds for LSB init scripts. However, pacemaker didn't
make any effort to check whether the service (typically some kind
of daemon) was actually started/stopped.

> By contrast, Pacemaker can't proceed with dependencies until the service
> is completely up. So, Pacemaker's workaround is to initiate the start,
> then poll the status every 2 seconds until systemd reports the service
> is completely up.
> 
> The 2 seconds is hardcoded, so there's no way to avoid that at the
> moment, other than using an OCF agent.

So, is this only for systemd resources or also LSB? Do you know
when the 2 second delay got introduced?

Thanks,

Dejan

> > take even half of that time to start/stop. i did the same thing with a
> > dummy service  as a systemd resource and it als takes 2 seconds to
> > start/stop the resource. Also, this is not happening in any of the ocf
> > resources. I am putting what i see in the logs about this behaviour
> > 
> > This is after doing "pcs resource restart rtpproxycluster"
> > 
> > Aug 17 20:59:18 [13187] kam1       crmd:   notice: te_rsc_command:    
> >  Initiating action 14: stop rtpproxycluster_stop_0 on kam1vs3 (local)
> > Aug 17 20:59:18 [13184] kam1       lrmd:     info:
> > cancel_recurring_action:     Cancelling systemd operation
> > rtpproxycluster_status_10000
> > Aug 17 20:59:18 [13187] kam1       crmd:     info: do_lrm_rsc_op:      
> > Performing key=14:30:0:8a202722-ece2-4617-b26e-8d4aa5f3522b
> > op=rtpproxycluster_stop_0
> > Aug 17 20:59:18 [13184] kam1       lrmd:     info: log_execute:
> > executing - rsc:rtpproxycluster action:stop call_id:106
> > Aug 17 20:59:18 [13187] kam1       crmd:     info: process_lrm_event:  
> > Operation rtpproxycluster_monitor_10000: Cancelled (node=kam1vs3,
> > call=104, confirmed=true)
> > Aug 17 20:59:18 [13184] kam1       lrmd:     info: systemd_exec_result:
> > Call to stop passed: /org/freedesktop/systemd1/job/8302
> > Aug 17 20:59:20 [13187] kam1       crmd:   notice: process_lrm_event:  
> > Operation rtpproxycluster_stop_0: ok (node=kam1vs3, call=106, rc=0,
> > cib-update=134, confirmed=true)
> > Aug 17 20:59:20 [13182] kam1        cib:     info: cib_perform_op:    
> >  +
> >  /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='rtpproxycluster']/lrm_rsc_op[@id='rtpproxycluster_last_0']:
> >  @operation_key=rtpproxycluster_stop_0, @operation=stop,
> > @transition-key=14:30:0:8a202722-ece2-4617-b26e-8d4aa5f3522b,
> > @transition-magic=0:0;14:30:0:8a202722-ece2-4617-b26e-8d4aa5f3522b,
> > @call-id=106, @last-run=1471481958, @last-rc-change=1471481958,
> > @exec-time=2116
> > 
> > Aug 17 20:59:20 [13186] kam1    pengine:     info: RecurringOp:  Start
> > recurring monitor (10s) for rtpproxycluster on kam1vs3
> > Aug 17 20:59:20 [13186] kam1    pengine:   notice: LogActions:  Start  
> > rtpproxycluster (kam1vs3)
> > Aug 17 20:59:20 [13187] kam1       crmd:   notice: te_rsc_command:    
> >  Initiating action 13: start rtpproxycluster_start_0 on kam1vs3 (local)
> > Aug 17 20:59:20 [13187] kam1       crmd:     info: do_lrm_rsc_op:      
> > Performing key=13:31:0:8a202722-ece2-4617-b26e-8d4aa5f3522b
> > op=rtpproxycluster_start_0
> > Aug 17 20:59:20 [13184] kam1       lrmd:     info: log_execute:
> > executing - rsc:rtpproxycluster action:start call_id:107
> > Aug 17 20:59:21 [13184] kam1       lrmd:     info: systemd_exec_result:
> > Call to start passed: /org/freedesktop/systemd1/job/8303
> > Aug 17 20:59:23 [13187] kam1       crmd:   notice: process_lrm_event:  
> > Operation rtpproxycluster_start_0: ok (node=kam1vs3, call=107, rc=0,
> > cib-update=136, confirmed=true)
> > Aug 17 20:59:23 [13182] kam1        cib:     info: cib_perform_op:    
> >  +
> >  /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='rtpproxycluster']/lrm_rsc_op[@id='rtpproxycluster_last_0']:
> >  @operation_key=rtpproxycluster_start_0, @operation=start,
> > @transition-key=13:31:0:8a202722-ece2-4617-b26e-8d4aa5f3522b,
> > @transition-magic=0:0;13:31:0:8a202722-ece2-4617-b26e-8d4aa5f3522b,
> > @call-id=107, @last-run=1471481960, @last-rc-change=1471481960,
> > @exec-time=2068
> > 
> > Why is this happening?, many of my resources depends on each other, so
> > the delay it takes when failover its needed is quite a lot (just 4
> > seconds of delay stopping/starting just for that systemd resource.
> > 
> > Is this normal?, is there anyway to solve this?, Do i really need to
> > make an OCF resource for that service (that would take some time that i
> > dont really have at this moment)
> > 
> > Regards
> > 
> > Alejandro
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org