[ClusterLabs] systemd RA start/stop delays
Ken Gaillot
kgaillot at redhat.com
Wed Aug 31 16:58:14 UTC 2016
On 08/30/2016 05:18 AM, Dejan Muhamedagic wrote:
> Hi,
>
> On Thu, Aug 18, 2016 at 09:00:24AM -0500, Ken Gaillot wrote:
>> On 08/17/2016 08:17 PM, TEG AMJG wrote:
>>> Hi
>>>
>>> I am having a problem with a simple Active/Passive cluster which
>>> consists in the next configuration
>>>
>>> Cluster Name: kamcluster
>>> Corosync Nodes:
>>> kam1vs3 kam2vs3
>>> Pacemaker Nodes:
>>> kam1vs3 kam2vs3
>>>
>>> Resources:
>>> Resource: ClusterIP (class=ocf provider=heartbeat type=IPaddr2)
>>> Attributes: ip=10.0.1.206 cidr_netmask=32
>>> Operations: start interval=0s timeout=20s (ClusterIP-start-interval-0s)
>>> stop interval=0s timeout=20s (ClusterIP-stop-interval-0s)
>>> monitor interval=10s (ClusterIP-monitor-interval-10s)
>>> Resource: ClusterIP2 (class=ocf provider=heartbeat type=IPaddr2)
>>> Attributes: ip=10.0.1.207 cidr_netmask=32
>>> Operations: start interval=0s timeout=20s (ClusterIP2-start-interval-0s)
>>> stop interval=0s timeout=20s (ClusterIP2-stop-interval-0s)
>>> monitor interval=10s (ClusterIP2-monitor-interval-10s)
>>> Resource: rtpproxycluster (class=systemd type=rtpproxy)
>>> Operations: monitor interval=10s (rtpproxycluster-monitor-interval-10s)
>>> stop interval=0s on-fail=block
>>> (rtpproxycluster-stop-interval-0s)
>>> Resource: kamailioetcfs (class=ocf provider=heartbeat type=Filesystem)
>>> Attributes: device=/dev/drbd1 directory=/etc/kamailio fstype=ext4
>>> Operations: start interval=0s timeout=60 (kamailioetcfs-start-interval-0s)
>>> monitor interval=10s on-fail=fence
>>> (kamailioetcfs-monitor-interval-1
>>> 0s)
>>> stop interval=0s on-fail=fence
>>> (kamailioetcfs-stop-interval-0s)
>>> Clone: fence_kam2_xvm-clone
>>> Meta Attrs: interleave=true clone-max=2 clone-node-max=1
>>> Resource: fence_kam2_xvm (class=stonith type=fence_xvm)
>>> Attributes: port=tegamjg_kam2 pcmk_host_list=kam2vs3
>>> Operations: monitor interval=60s (fence_kam2_xvm-monitor-interval-60s)
>>> Master: kamailioetcclone
>>> Meta Attrs: master-max=1 master-node-max=1 clone-max=2
>>> clone-node-max=1 notify=t
>>> rue on-fail=fence
>>> Resource: kamailioetc (class=ocf provider=linbit type=drbd)
>>> Attributes: drbd_resource=kamailioetc
>>> Operations: start interval=0s timeout=240 (kamailioetc-start-interval-0s)
>>> promote interval=0s on-fail=fence
>>> (kamailioetc-promote-interval-0s)
>>> demote interval=0s on-fail=fence
>>> (kamailioetc-demote-interval-0s)
>>> stop interval=0s on-fail=fence (kamailioetc-stop-interval-0s)
>>> monitor interval=10s (kamailioetc-monitor-interval-10s)
>>> Clone: fence_kam1_xvm-clone
>>> Meta Attrs: interleave=true clone-max=2 clone-node-max=1
>>> Resource: fence_kam1_xvm (class=stonith type=fence_xvm)
>>> Attributes: port=tegamjg_kam1 pcmk_host_list=kam1vs3
>>> Operations: monitor interval=60s (fence_kam1_xvm-monitor-interval-60s)
>>> Resource: kamailiocluster (class=ocf provider=heartbeat type=kamailio)
>>> Attributes: listen_address=10.0.1.206
>>> conffile=/etc/kamailio/kamailio.cfg pidfil
>>>
>>> e=/var/run/kamailio.pid monitoring_ip=10.0.1.206
>>> monitoring_ip2=10.0.1.207 port=50
>>> 60 proto=udp
>>> kamctlrc=/etc/kamailio/kamctlrc shmem=128 pkg=8
>>> Meta Attrs: target-role=Stopped
>>> Operations: start interval=0s timeout=60
>>> (kamailiocluster-start-interval-0s)
>>> stop interval=0s timeout=30 (kamailiocluster-stop-interval-0s)
>>> monitor interval=5s (kamailiocluster-monitor-interval-5s)
>>>
>>> Stonith Devices:
>>> Fencing Levels:
>>>
>>> Location Constraints:
>>> Ordering Constraints:
>>> start fence_kam1_xvm-clone then start fence_kam2_xvm-clone
>>> (kind:Mandatory) (id:
>>>
>>> order-fence_kam1_xvm-clone-fence_kam2_xvm-clone-mandatory)
>>> start fence_kam2_xvm-clone then promote kamailioetcclone
>>> (kind:Mandatory) (id:or
>>>
>>> der-fence_kam2_xvm-clone-kamailioetcclone-mandatory)
>>> promote kamailioetcclone then start kamailioetcfs (kind:Optional)
>>> (id:order-kama
>>> ilioetcclone-kamailioetcfs-Optional)
>>> Resource Sets:
>>> set kamailioetcfs sequential=true (id:pcs_rsc_set_kamailioetcfs) set
>>> ClusterIP
>>> ClusterIP2 sequential=false
>>> (id:pcs_rsc_set_ClusterIP_ClusterIP2) set rtpproxyclu
>>> ster
>>> kamailiocluster sequential=true
>>> (id:pcs_rsc_set_rtpproxycluster_kamailioclust
>>> er)
>>> (id:pcs_rsc_order_set_kamailioetcfs_set_ClusterIP_ClusterIP2_set_rtpproxyclust
>>>
>>> er_kamailiocluster)
>>> Colocation Constraints:
>>> rtpproxycluster with ClusterIP2 (score:INFINITY)
>>> (id:colocation-rtpproxycluster-
>>> ClusterIP2-INFINITY)
>>> ClusterIP2 with ClusterIP (score:INFINITY)
>>> (id:colocation-ClusterIP2-ClusterIP-I
>>> NFINITY)
>>> ClusterIP with kamailioetcfs (score:INFINITY)
>>> (id:colocation-ClusterIP-kamailioe
>>> tcfs-INFINITY)
>>> kamailioetcfs with kamailioetcclone (score:INFINITY)
>>> (with-rsc-role:Master) (id:
>>>
>>> colocation-kamailioetcfs-kamailioetcclone-INFINITY)
>>> fence_kam2_xvm-clone with fence_kam1_xvm-clone (score:INFINITY)
>>> (id:colocation-f
>>>
>>> ence_kam2_xvm-clone-fence_kam1_xvm-clone-INFINITY)
>>> kamailioetcclone with fence_kam2_xvm-clone (score:INFINITY)
>>> (id:colocation-kamai
>>> lioetcclone-fence_kam2_xvm-clone-INFINITY)
>>> kamailiocluster with rtpproxycluster (score:INFINITY)
>>> (id:colocation-kamailioclu
>>> ster-rtpproxycluster-INFINITY)
>>>
>>> Resources Defaults:
>>> migration-threshold: 2
>>> failure-timeout: 10m
>>> resource-stickiness: 200
>>> Operations Defaults:
>>> No defaults set
>>>
>>> Cluster Properties:
>>> cluster-infrastructure: corosync
>>> cluster-name: kamcluster
>>> dc-version: 1.1.13-10.el7_2.2-44eb2dd
>>> have-watchdog: false
>>> last-lrm-refresh: 1471479940
>>> no-quorum-policy: ignore
>>> start-failure-is-fatal: false
>>> stonith-action: reboot
>>> stonith-enabled: true
>>>
>>> Now my problem is when the rtpproxy systemd resource starts/stops, i am
>>> experiencing 2 seconds delay between the call from the start/stop signal
>>> is sent to systemd and the confirmation and interpretation of the action
>>> to the crmd. I am pretty sure that the initiation of this service doesnt
>>
>> The problem is that systemd does not have a way to say "start this
>> service, and don't tell me it's done until it's completely started".
>> Systemd reports successful completion as soon as the start has been
>> initiated.
>
> The same holds for LSB init scripts. However, pacemaker didn't
> make any effort to check whether the service (typically some kind
> of daemon) was actually started/stopped.
>
>> By contrast, Pacemaker can't proceed with dependencies until the service
>> is completely up. So, Pacemaker's workaround is to initiate the start,
>> then poll the status every 2 seconds until systemd reports the service
>> is completely up.
>>
>> The 2 seconds is hardcoded, so there's no way to avoid that at the
>> moment, other than using an OCF agent.
>
> So, is this only for systemd resources or also LSB? Do you know
> when the 2 second delay got introduced?
Only systemd and nagios. It was introduced in commit c78b4b8, released
with 1.1.12.
Coincidentally, we're currently looking at using systemd's DBus
signaling, rather than the 2-second polling, to determine when the
action has taken effect. That's still in the early stages, but hopefully
it will be a better approach.
>
> Thanks,
>
> Dejan
>
>>> take even half of that time to start/stop. i did the same thing with a
>>> dummy service as a systemd resource and it als takes 2 seconds to
>>> start/stop the resource. Also, this is not happening in any of the ocf
>>> resources. I am putting what i see in the logs about this behaviour
>>>
>>> This is after doing "pcs resource restart rtpproxycluster"
>>>
>>> Aug 17 20:59:18 [13187] kam1 crmd: notice: te_rsc_command:
>>> Initiating action 14: stop rtpproxycluster_stop_0 on kam1vs3 (local)
>>> Aug 17 20:59:18 [13184] kam1 lrmd: info:
>>> cancel_recurring_action: Cancelling systemd operation
>>> rtpproxycluster_status_10000
>>> Aug 17 20:59:18 [13187] kam1 crmd: info: do_lrm_rsc_op:
>>> Performing key=14:30:0:8a202722-ece2-4617-b26e-8d4aa5f3522b
>>> op=rtpproxycluster_stop_0
>>> Aug 17 20:59:18 [13184] kam1 lrmd: info: log_execute:
>>> executing - rsc:rtpproxycluster action:stop call_id:106
>>> Aug 17 20:59:18 [13187] kam1 crmd: info: process_lrm_event:
>>> Operation rtpproxycluster_monitor_10000: Cancelled (node=kam1vs3,
>>> call=104, confirmed=true)
>>> Aug 17 20:59:18 [13184] kam1 lrmd: info: systemd_exec_result:
>>> Call to stop passed: /org/freedesktop/systemd1/job/8302
>>> Aug 17 20:59:20 [13187] kam1 crmd: notice: process_lrm_event:
>>> Operation rtpproxycluster_stop_0: ok (node=kam1vs3, call=106, rc=0,
>>> cib-update=134, confirmed=true)
>>> Aug 17 20:59:20 [13182] kam1 cib: info: cib_perform_op:
>>> +
>>> /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='rtpproxycluster']/lrm_rsc_op[@id='rtpproxycluster_last_0']:
>>> @operation_key=rtpproxycluster_stop_0, @operation=stop,
>>> @transition-key=14:30:0:8a202722-ece2-4617-b26e-8d4aa5f3522b,
>>> @transition-magic=0:0;14:30:0:8a202722-ece2-4617-b26e-8d4aa5f3522b,
>>> @call-id=106, @last-run=1471481958, @last-rc-change=1471481958,
>>> @exec-time=2116
>>>
>>> Aug 17 20:59:20 [13186] kam1 pengine: info: RecurringOp: Start
>>> recurring monitor (10s) for rtpproxycluster on kam1vs3
>>> Aug 17 20:59:20 [13186] kam1 pengine: notice: LogActions: Start
>>> rtpproxycluster (kam1vs3)
>>> Aug 17 20:59:20 [13187] kam1 crmd: notice: te_rsc_command:
>>> Initiating action 13: start rtpproxycluster_start_0 on kam1vs3 (local)
>>> Aug 17 20:59:20 [13187] kam1 crmd: info: do_lrm_rsc_op:
>>> Performing key=13:31:0:8a202722-ece2-4617-b26e-8d4aa5f3522b
>>> op=rtpproxycluster_start_0
>>> Aug 17 20:59:20 [13184] kam1 lrmd: info: log_execute:
>>> executing - rsc:rtpproxycluster action:start call_id:107
>>> Aug 17 20:59:21 [13184] kam1 lrmd: info: systemd_exec_result:
>>> Call to start passed: /org/freedesktop/systemd1/job/8303
>>> Aug 17 20:59:23 [13187] kam1 crmd: notice: process_lrm_event:
>>> Operation rtpproxycluster_start_0: ok (node=kam1vs3, call=107, rc=0,
>>> cib-update=136, confirmed=true)
>>> Aug 17 20:59:23 [13182] kam1 cib: info: cib_perform_op:
>>> +
>>> /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='rtpproxycluster']/lrm_rsc_op[@id='rtpproxycluster_last_0']:
>>> @operation_key=rtpproxycluster_start_0, @operation=start,
>>> @transition-key=13:31:0:8a202722-ece2-4617-b26e-8d4aa5f3522b,
>>> @transition-magic=0:0;13:31:0:8a202722-ece2-4617-b26e-8d4aa5f3522b,
>>> @call-id=107, @last-run=1471481960, @last-rc-change=1471481960,
>>> @exec-time=2068
>>>
>>> Why is this happening?, many of my resources depends on each other, so
>>> the delay it takes when failover its needed is quite a lot (just 4
>>> seconds of delay stopping/starting just for that systemd resource.
>>>
>>> Is this normal?, is there anyway to solve this?, Do i really need to
>>> make an OCF resource for that service (that would take some time that i
>>> dont really have at this moment)
>>>
>>> Regards
>>>
>>> Alejandro
More information about the Users
mailing list