[ClusterLabs] systemd RA start/stop delays

Thu Aug 18 06:46:53 UTC 2016

On 08/18/2016 03:17 AM, TEG AMJG wrote:
> Hi
>
> I am having a problem with a simple Active/Passive cluster which
> consists in the next configuration
>
> Cluster Name: kamcluster
> Corosync Nodes:
>  kam1vs3 kam2vs3
> Pacemaker Nodes:
>  kam1vs3 kam2vs3
>
> Resources:
>  Resource: ClusterIP (class=ocf provider=heartbeat type=IPaddr2)
>   Attributes: ip=10.0.1.206 cidr_netmask=32
>   Operations: start interval=0s timeout=20s (ClusterIP-start-interval-0s)
>               stop interval=0s timeout=20s (ClusterIP-stop-interval-0s)
>               monitor interval=10s (ClusterIP-monitor-interval-10s)
>  Resource: ClusterIP2 (class=ocf provider=heartbeat type=IPaddr2)
>   Attributes: ip=10.0.1.207 cidr_netmask=32
>   Operations: start interval=0s timeout=20s (ClusterIP2-start-interval-0s)
>               stop interval=0s timeout=20s (ClusterIP2-stop-interval-0s)
>               monitor interval=10s (ClusterIP2-monitor-interval-10s)
>  Resource: rtpproxycluster (class=systemd type=rtpproxy)
>   Operations: monitor interval=10s (rtpproxycluster-monitor-interval-10s)
>               stop interval=0s on-fail=block
> (rtpproxycluster-stop-interval-0s)
>  Resource: kamailioetcfs (class=ocf provider=heartbeat type=Filesystem)
>   Attributes: device=/dev/drbd1 directory=/etc/kamailio fstype=ext4
>   Operations: start interval=0s timeout=60
> (kamailioetcfs-start-interval-0s)
>               monitor interval=10s on-fail=fence
> (kamailioetcfs-monitor-interval-1                                    
>                                                  0s)
>               stop interval=0s on-fail=fence
> (kamailioetcfs-stop-interval-0s)
>  Clone: fence_kam2_xvm-clone
>   Meta Attrs: interleave=true clone-max=2 clone-node-max=1
>   Resource: fence_kam2_xvm (class=stonith type=fence_xvm)
>    Attributes: port=tegamjg_kam2 pcmk_host_list=kam2vs3
>    Operations: monitor interval=60s (fence_kam2_xvm-monitor-interval-60s)
>  Master: kamailioetcclone
>   Meta Attrs: master-max=1 master-node-max=1 clone-max=2
> clone-node-max=1 notify=t                                            
>                                          rue on-fail=fence
>   Resource: kamailioetc (class=ocf provider=linbit type=drbd)
>    Attributes: drbd_resource=kamailioetc
>    Operations: start interval=0s timeout=240
> (kamailioetc-start-interval-0s)
>                promote interval=0s on-fail=fence
> (kamailioetc-promote-interval-0s)
>                demote interval=0s on-fail=fence
> (kamailioetc-demote-interval-0s)
>                stop interval=0s on-fail=fence
> (kamailioetc-stop-interval-0s)
>                monitor interval=10s (kamailioetc-monitor-interval-10s)
>  Clone: fence_kam1_xvm-clone
>   Meta Attrs: interleave=true clone-max=2 clone-node-max=1
>   Resource: fence_kam1_xvm (class=stonith type=fence_xvm)
>    Attributes: port=tegamjg_kam1 pcmk_host_list=kam1vs3
>    Operations: monitor interval=60s (fence_kam1_xvm-monitor-interval-60s)
>  Resource: kamailiocluster (class=ocf provider=heartbeat type=kamailio)
>   Attributes: listen_address=10.0.1.206
> conffile=/etc/kamailio/kamailio.cfg pidfil                            
>                                                        
>  e=/var/run/kamailio.pid monitoring_ip=10.0.1.206
> monitoring_ip2=10.0.1.207 port=50                                    
>                                                  60 proto=udp
> kamctlrc=/etc/kamailio/kamctlrc shmem=128 pkg=8
>   Meta Attrs: target-role=Stopped
>   Operations: start interval=0s timeout=60
> (kamailiocluster-start-interval-0s)
>               stop interval=0s timeout=30
> (kamailiocluster-stop-interval-0s)
>               monitor interval=5s (kamailiocluster-monitor-interval-5s)
>
> Stonith Devices:
> Fencing Levels:
>
> Location Constraints:
> Ordering Constraints:
>   start fence_kam1_xvm-clone then start fence_kam2_xvm-clone
> (kind:Mandatory) (id:                                                
>                                    
>  order-fence_kam1_xvm-clone-fence_kam2_xvm-clone-mandatory)
>   start fence_kam2_xvm-clone then promote kamailioetcclone
> (kind:Mandatory) (id:or                                              
>                                      
>  der-fence_kam2_xvm-clone-kamailioetcclone-mandatory)
>   promote kamailioetcclone then start kamailioetcfs (kind:Optional)
> (id:order-kama                                                        
>                              ilioetcclone-kamailioetcfs-Optional)
>   Resource Sets:
>     set kamailioetcfs sequential=true (id:pcs_rsc_set_kamailioetcfs)
> set ClusterIP                                                        
>                               ClusterIP2 sequential=false
> (id:pcs_rsc_set_ClusterIP_ClusterIP2) set rtpproxyclu                
>                                                                    
>  ster kamailiocluster sequential=true
> (id:pcs_rsc_set_rtpproxycluster_kamailioclust                        
>                                                              er)
> (id:pcs_rsc_order_set_kamailioetcfs_set_ClusterIP_ClusterIP2_set_rtpproxyclust
>                                                                      
>                er_kamailiocluster)
> Colocation Constraints:
>   rtpproxycluster with ClusterIP2 (score:INFINITY)
> (id:colocation-rtpproxycluster-                                      
>                                                ClusterIP2-INFINITY)
>   ClusterIP2 with ClusterIP (score:INFINITY)
> (id:colocation-ClusterIP2-ClusterIP-I                                
>                                                      NFINITY)
>   ClusterIP with kamailioetcfs (score:INFINITY)
> (id:colocation-ClusterIP-kamailioe                                    
>                                                  tcfs-INFINITY)
>   kamailioetcfs with kamailioetcclone (score:INFINITY)
> (with-rsc-role:Master) (id:                                          
>                                          
>  colocation-kamailioetcfs-kamailioetcclone-INFINITY)
>   fence_kam2_xvm-clone with fence_kam1_xvm-clone (score:INFINITY)
> (id:colocation-f                                                      
>                              
>  ence_kam2_xvm-clone-fence_kam1_xvm-clone-INFINITY)
>   kamailioetcclone with fence_kam2_xvm-clone (score:INFINITY)
> (id:colocation-kamai                                                  
>                                  
>  lioetcclone-fence_kam2_xvm-clone-INFINITY)
>   kamailiocluster with rtpproxycluster (score:INFINITY)
> (id:colocation-kamailioclu                                            
>                                          ster-rtpproxycluster-INFINITY)
>
> Resources Defaults:
>  migration-threshold: 2
>  failure-timeout: 10m
>  resource-stickiness: 200
> Operations Defaults:
>  No defaults set
>
> Cluster Properties:
>  cluster-infrastructure: corosync
>  cluster-name: kamcluster
>  dc-version: 1.1.13-10.el7_2.2-44eb2dd
>  have-watchdog: false
>  last-lrm-refresh: 1471479940
>  no-quorum-policy: ignore
>  start-failure-is-fatal: false
>  stonith-action: reboot
>  stonith-enabled: true

Your pacemaker-version is kind of old.
Somewhere in the back of my mind I think I remember similar
delays  (a subsequent monitor taking  it out of a bad situation)
when doing investigations for
https://github.com/ClusterLabs/pacemaker/commit/5e804ec3eab232d96b2ede264bb9286294ae33f4
Maybe you could give it a try with something newer or at least
that commit added.
It is just one additional check and doing the systemd-reload
every time when requested and not just every 10th time.
Anyway - might be related to the systemd-reloads needed
when the override-files are written/deleted.
It might either be systemd-waiting for some other reason
to find the changes or the other way round the reload just
taking that long...

>
> Now my problem is when the rtpproxy systemd resource starts/stops, i
> am experiencing 2 seconds delay between the call from the start/stop
> signal is sent to systemd and the confirmation and interpretation of
> the action to the crmd. I am pretty sure that the initiation of this
> service doesnt take even half of that time to start/stop. i did the
> same thing with a dummy service  as a systemd resource and it als
> takes 2 seconds to start/stop the resource. Also, this is not
> happening in any of the ocf resources. I am putting what i see in the
> logs about this behaviour
>
> This is after doing "pcs resource restart rtpproxycluster"
>
> Aug 17 20:59:18 [13187] kam1       crmd:   notice: te_rsc_command:    
>  Initiating action 14: stop rtpproxycluster_stop_0 on kam1vs3 (local)
> Aug 17 20:59:18 [13184] kam1       lrmd:     info:
> cancel_recurring_action:     Cancelling systemd operation
> rtpproxycluster_status_10000
> Aug 17 20:59:18 [13187] kam1       crmd:     info: do_lrm_rsc_op:    
>   Performing key=14:30:0:8a202722-ece2-4617-b26e-8d4aa5f3522b
> op=rtpproxycluster_stop_0
> Aug 17 20:59:18 [13184] kam1       lrmd:     info: log_execute:
> executing - rsc:rtpproxycluster action:stop call_id:106
> Aug 17 20:59:18 [13187] kam1       crmd:     info: process_lrm_event:
>   Operation rtpproxycluster_monitor_10000: Cancelled (node=kam1vs3,
> call=104, confirmed=true)
> Aug 17 20:59:18 [13184] kam1       lrmd:     info:
> systemd_exec_result: Call to stop passed:
> /org/freedesktop/systemd1/job/8302
> Aug 17 20:59:20 [13187] kam1       crmd:   notice: process_lrm_event:
>   Operation rtpproxycluster_stop_0: ok (node=kam1vs3, call=106, rc=0,
> cib-update=134, confirmed=true)
> Aug 17 20:59:20 [13182] kam1        cib:     info: cib_perform_op:    
>  +
>  /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='rtpproxycluster']/lrm_rsc_op[@id='rtpproxycluster_last_0']:
>  @operation_key=rtpproxycluster_stop_0, @operation=stop,
> @transition-key=14:30:0:8a202722-ece2-4617-b26e-8d4aa5f3522b,
> @transition-magic=0:0;14:30:0:8a202722-ece2-4617-b26e-8d4aa5f3522b,
> @call-id=106, @last-run=1471481958, @last-rc-change=1471481958,
> @exec-time=2116
>
> Aug 17 20:59:20 [13186] kam1    pengine:     info: RecurringOp:  Start
> recurring monitor (10s) for rtpproxycluster on kam1vs3
> Aug 17 20:59:20 [13186] kam1    pengine:   notice: LogActions:  Start
>   rtpproxycluster (kam1vs3)
> Aug 17 20:59:20 [13187] kam1       crmd:   notice: te_rsc_command:    
>  Initiating action 13: start rtpproxycluster_start_0 on kam1vs3 (local)
> Aug 17 20:59:20 [13187] kam1       crmd:     info: do_lrm_rsc_op:    
>   Performing key=13:31:0:8a202722-ece2-4617-b26e-8d4aa5f3522b
> op=rtpproxycluster_start_0
> Aug 17 20:59:20 [13184] kam1       lrmd:     info: log_execute:
> executing - rsc:rtpproxycluster action:start call_id:107
> Aug 17 20:59:21 [13184] kam1       lrmd:     info:
> systemd_exec_result: Call to start passed:
> /org/freedesktop/systemd1/job/8303
> Aug 17 20:59:23 [13187] kam1       crmd:   notice: process_lrm_event:
>   Operation rtpproxycluster_start_0: ok (node=kam1vs3, call=107, rc=0,
> cib-update=136, confirmed=true)
> Aug 17 20:59:23 [13182] kam1        cib:     info: cib_perform_op:    
>  +
>  /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='rtpproxycluster']/lrm_rsc_op[@id='rtpproxycluster_last_0']:
>  @operation_key=rtpproxycluster_start_0, @operation=start,
> @transition-key=13:31:0:8a202722-ece2-4617-b26e-8d4aa5f3522b,
> @transition-magic=0:0;13:31:0:8a202722-ece2-4617-b26e-8d4aa5f3522b,
> @call-id=107, @last-run=1471481960, @last-rc-change=1471481960,
> @exec-time=2068
>
> Why is this happening?, many of my resources depends on each other, so
> the delay it takes when failover its needed is quite a lot (just 4
> seconds of delay stopping/starting just for that systemd resource.
>
> Is this normal?, is there anyway to solve this?, Do i really need to
> make an OCF resource for that service (that would take some time that
> i dont really have at this moment)
>
> Regards
>
> Alejandro
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org