[ClusterLabs] pacemaker systemd resource

Andrei Borzenkov arvidjaar at gmail.com
Wed Jul 22 13:57:27 EDT 2020


22.07.2020 12:46, Хиль Эдуард пишет:
> 
> Hey, Andrei! Thanx for ur time!
> A-a-and there is no chance to do something? :( 
> The pacemaker’s log below.
>  

Resource was started:

...
> Jul 22 12:38:36 node2.local pacemaker-execd     [1721] (log_execute)     info: executing - rsc:dummy.service action:start call_id:76
> Jul 22 12:38:36 node2.local pacemaker-based     [1719] (cib_perform_op)     info: Diff: --- 0.131.4 2
> Jul 22 12:38:36 node2.local pacemaker-based     [1719] (cib_perform_op)     info: Diff: +++ 0.131.5 (null)
> Jul 22 12:38:36 node2.local pacemaker-based     [1719] (cib_perform_op)     info: +  /cib:  @num_updates=5
> Jul 22 12:38:36 node2.local pacemaker-based     [1719] (cib_perform_op)     info: +  /cib/status/node_state[@id='2']/lrm[@id='2']/lrm_resources/lrm_resource[@id='dummy.service']/lrm_rsc_op[@id='dummy.service_last_0']:  @operation_key=dummy.service_start_0, @operation=start, @transition-key=164:23:0:76f4932e-716b-45b8-8fed-a20c3806df8a, @transition-magic=-1:193;164:23:0:76f4932e-716b-45b8-8fed-a20c3806df8a, @call-id=-1, @rc-code=193, @op-status=-1, @last-rc-change=1595410716, @last-run=1595410716, @e
> Jul 22 12:38:36 node2.local pacemaker-based     [1719] (cib_process_request)     info: Completed cib_modify operation for section status: OK (rc=0, origin=node2.local/crmd/62, version=0.131.5)
> Jul 22 12:38:36 node2.local pacemaker-execd     [1721] (systemd_exec_result)     info: Call to start passed: /org/freedesktop/systemd1/job/703
> Jul 22 12:38:38 node2.local pacemaker-controld  [1724] (process_lrm_event)     notice: Result of start operation for dummy.service on node2.local: 0 (ok) | call=76 key=dummy.service_start_0 confirmed=true cib-update=63

So start operation at least was successfully completed.

> Jul 22 12:38:38 node2.local pacemaker-based     [1719] (cib_process_request)     info: Forwarding cib_modify operation for section status to all (origin=local/crmd/63)
> Jul 22 12:38:38 node2.local pacemaker-based     [1719] (cib_perform_op)     info: Diff: --- 0.131.5 2
> Jul 22 12:38:38 node2.local pacemaker-based     [1719] (cib_perform_op)     info: Diff: +++ 0.131.6 (null)
> Jul 22 12:38:38 node2.local pacemaker-based     [1719] (cib_perform_op)     info: +  /cib:  @num_updates=6
> Jul 22 12:38:38 node2.local pacemaker-based     [1719] (cib_perform_op)     info: +  /cib/status/node_state[@id='2']/lrm[@id='2']/lrm_resources/lrm_resource[@id='dummy.service']/lrm_rsc_op[@id='dummy.service_last_0']:  @transition-magic=0:0;164:23:0:76f4932e-716b-45b8-8fed-a20c3806df8a, @call-id=76, @rc-code=0, @op-status=0, @last-rc-change=1986, @last-run=1986, @exec-time=-587720, @queue-time=59
> Jul 22 12:38:38 node2.local pacemaker-based     [1719] (cib_process_request)     info: Completed cib_modify operation for section status: OK (rc=0, origin=node2.local/crmd/63, version=0.131.6)
> Jul 22 12:38:38 node2.local pacemaker-controld  [1724] (do_lrm_rsc_op)     info: Performing key=165:23:0:76f4932e-716b-45b8-8fed-a20c3806df8a op=dummy.service_monitor_60000
> Jul 22 12:38:38 node2.local pacemaker-based     [1719] (cib_process_request)     info: Forwarding cib_modify operation for section status to all (origin=local/crmd/64)
> Jul 22 12:38:38 node2.local pacemaker-based     [1719] (cib_perform_op)     info: Diff: --- 0.131.6 2
> Jul 22 12:38:38 node2.local pacemaker-based     [1719] (cib_perform_op)     info: Diff: +++ 0.131.7 (null)
> Jul 22 12:38:38 node2.local pacemaker-based     [1719] (cib_perform_op)     info: +  /cib:  @num_updates=7
> Jul 22 12:38:38 node2.local pacemaker-based     [1719] (cib_perform_op)     info: ++ /cib/status/node_state[@id='2']/lrm[@id='2']/lrm_resources/lrm_resource[@id='dummy.service']:  <lrm_rsc_op id="dummy.service_monitor_60000" operation_key="dummy.service_monitor_60000" operation="monitor" crm-debug-origin="do_update_resource" crm_feature_set="3.2.0" transition-key="165:23:0:76f4932e-716b-45b8-8fed-a20c3806df8a" transition-magic="-1:193;165:23:0:76f4932e-716b-45b8-8fed-a20c3806df8a" exit-reason="" on_
> Jul 22 12:38:38 node2.local pacemaker-based     [1719] (cib_process_request)     info: Completed cib_modify operation for section status: OK (rc=0, origin=node2.local/crmd/64, version=0.131.7)
> Jul 22 12:38:38 node2.local pacemaker-controld  [1724] (process_lrm_event)     notice: Result of monitor operation for dummy.service on node2.local: 0 (ok) | call=77 key=dummy.service_monitor_60000 confirmed=false cib-update=65

And monitor confirmed that it was started

> Jul 22 12:38:38 node2.local pacemaker-based     [1719] (cib_process_request)     info: Forwarding cib_modify operation for section status to all (origin=local/crmd/65)
> Jul 22 12:38:38 node2.local pacemaker-based     [1719] (cib_perform_op)     info: Diff: --- 0.131.7 2
> Jul 22 12:38:38 node2.local pacemaker-based     [1719] (cib_perform_op)     info: Diff: +++ 0.131.8 (null)
> Jul 22 12:38:38 node2.local pacemaker-based     [1719] (cib_perform_op)     info: +  /cib:  @num_updates=8
> Jul 22 12:38:38 node2.local pacemaker-based     [1719] (cib_perform_op)     info: +  /cib/status/node_state[@id='2']/lrm[@id='2']/lrm_resources/lrm_resource[@id='dummy.service']/lrm_rsc_op[@id='dummy.service_monitor_60000']:  @transition-magic=0:0;165:23:0:76f4932e-716b-45b8-8fed-a20c3806df8a, @call-id=77, @rc-code=0, @op-status=0, @last-rc-change=1989, @exec-time=1542, @queue-time=63
> Jul 22 12:38:38 node2.local pacemaker-based     [1719] (cib_process_request)     info: Completed cib_modify operation for section status: OK (rc=0, origin=node2.local/crmd/65, version=0.131.8)
> Jul 22 12:38:42 node2.local pacemaker-based     [1719] (cib_process_request)     info: Forwarding cib_apply_diff operation for section 'all' to all (origin=local/cibadmin/2)
> Jul 22 12:38:42 node2.local pacemaker-based     [1719] (cib_perform_op)     info: Diff: --- 0.131.8 2
> Jul 22 12:38:42 node2.local pacemaker-based     [1719] (cib_perform_op)     info: Diff: +++ 0.132.0 (null)
> Jul 22 12:38:42 node2.local pacemaker-based     [1719] (cib_perform_op)     info: +  /cib:  @epoch=132, @num_updates=0
> Jul 22 12:38:42 node2.local pacemaker-based     [1719] (cib_perform_op)     info: ++ /cib/configuration/resources/primitive[@id='dummy.service']:  <meta_attributes id="dummy.service-meta_attributes"/>
> Jul 22 12:38:42 node2.local pacemaker-based     [1719] (cib_perform_op)     info: ++                                                                  <nvpair id="dummy.service-meta_attributes-target-role" name="target-role" value="Stopped"/>

Something requested stop of this resource. Logs from DC at this point
may provide more clues why it happened.

> Jul 22 12:38:42 node2.local pacemaker-based     [1719] (cib_perform_op)     info: ++                                                                </meta_attributes>
> Jul 22 12:38:42 node2.local pacemaker-based     [1719] (cib_process_request)     info: Completed cib_apply_diff operation for section 'all': OK (rc=0, origin=node2.local/cibadmin/2, version=0.132.0)
> Jul 22 12:38:42 node2.local pacemaker-fenced    [1720] (update_cib_stonith_devices_v2)     info: Updating device list from the cib: create primitive[@id='dummy.service']
> Jul 22 12:38:42 node2.local pacemaker-fenced    [1720] (cib_devices_update)     info: Updating devices to version 0.132.0
> Jul 22 12:38:42 node2.local pacemaker-fenced    [1720] (cib_device_update)     info: Device mfs4.stonith has been disabled on node2.local: score=-INFINITY
> Jul 22 12:38:42 node2.local pacemaker-based     [1719] (cib_file_backup)     info: Archived previous version as /var/lib/pacemaker/cib/cib-21.raw
> Jul 22 12:38:42 node2.local pacemaker-based     [1719] (cib_file_write_with_digest)     info: Wrote version 0.132.0 of the CIB to disk (digest: 408cdd7e93bae2d8e2899d800ab82c61)
> Jul 22 12:38:42 node2.local pacemaker-based     [1719] (cib_file_write_with_digest)     info: Reading cluster configuration file /var/lib/pacemaker/cib/cib.oVovE6 (digest: /var/lib/pacemaker/cib/cib.SiUE36)
> Jul 22 12:38:42 node2.local pacemaker-execd     [1721] (cancel_recurring_action)     info: Cancelling systemd operation dummy.service_status_60000
> Jul 22 12:38:42 node2.local pacemaker-controld  [1724] (do_lrm_rsc_op)     info: Performing key=163:24:0:76f4932e-716b-45b8-8fed-a20c3806df8a op=dummy.service_stop_0
> Jul 22 12:38:42 node2.local pacemaker-execd     [1721] (log_execute)     info: executing - rsc:dummy.service action:stop call_id:79
> Jul 22 12:38:42 node2.local pacemaker-based     [1719] (cib_process_request)     info: Forwarding cib_modify operation for section status to all (origin=local/crmd/66)
> Jul 22 12:38:42 node2.local pacemaker-controld  [1724] (process_lrm_event)     info: Result of monitor operation for dummy.service on node2.local: Cancelled | call=77 key=dummy.service_monitor_60000 confirmed=true
> Jul 22 12:38:42 node2.local pacemaker-based     [1719] (cib_perform_op)     info: Diff: --- 0.132.0 2
> Jul 22 12:38:42 node2.local pacemaker-based     [1719] (cib_perform_op)     info: Diff: +++ 0.132.1 (null)
> Jul 22 12:38:42 node2.local pacemaker-based     [1719] (cib_perform_op)     info: +  /cib:  @num_updates=1
> Jul 22 12:38:42 node2.local pacemaker-based     [1719] (cib_perform_op)     info: +  /cib/status/node_state[@id='2']/lrm[@id='2']/lrm_resources/lrm_resource[@id='dummy.service']/lrm_rsc_op[@id='dummy.service_last_0']:  @operation_key=dummy.service_stop_0, @operation=stop, @transition-key=163:24:0:76f4932e-716b-45b8-8fed-a20c3806df8a, @transition-magic=-1:193;163:24:0:76f4932e-716b-45b8-8fed-a20c3806df8a, @call-id=-1, @rc-code=193, @op-status=-1, @last-rc-change=1595410722, @last-run=1595410722, @exe
> Jul 22 12:38:42 node2.local pacemaker-based     [1719] (cib_process_request)     info: Completed cib_modify operation for section status: OK (rc=0, origin=node2.local/crmd/66, version=0.132.1)
> Jul 22 12:38:42 node2.local pacemaker-execd     [1721] (systemd_exec_result)     info: Call to stop passed: /org/freedesktop/systemd1/job/798
> Jul 22 12:38:42 node2.local pacemaker-execd     [1721] (action_complete)     notice: Giving up on dummy.service stop (rc=0): timeout (elapsed=299980ms, remaining=-199980ms)

Well, times are absolutely nonsensical. Operation was requested less
than a second ago.

That sounds like the same problem as reported in

https://lists.clusterlabs.org/pipermail/users/2020-June/027212.html

You need to investigate what tries to stop you resource immediately
after start.

> Jul 22 12:38:42 node2.local pacemaker-controld  [1724] (process_lrm_event)     error: Result of stop operation for dummy.service on node2.local: Timed Out | call=79 key=dummy.service_stop_0 timeout=100000ms
> Jul 22 12:38:42 node2.local pacemaker-based     [1719] (cib_process_request)     info: Forwarding cib_modify operation for section status to all (origin=local/crmd/67)
> Jul 22 12:38:42 node2.local pacemaker-based     [1719] (cib_perform_op)     info: Diff: --- 0.132.1 2
> Jul 22 12:38:42 node2.local pacemaker-based     [1719] (cib_perform_op)     info: Diff: +++ 0.132.2 (null)
> Jul 22 12:38:42 node2.local pacemaker-based     [1719] (cib_perform_op)     info: +  /cib:  @num_updates=2
> Jul 22 12:38:42 node2.local pacemaker-based     [1719] (cib_perform_op)     info: +  /cib/status/node_state[@id='2']/lrm[@id='2']/lrm_resources/lrm_resource[@id='dummy.service']/lrm_rsc_op[@id='dummy.service_last_0']:  @transition-magic=2:198;163:24:0:76f4932e-716b-45b8-8fed-a20c3806df8a, @call-id=79, @rc-code=198, @op-status=2, @last-rc-change=1992, @last-run=1992, @exec-time=300031, @queue-time=57
> Jul 22 12:38:42 node2.local pacemaker-based     [1719] (cib_perform_op)     info: ++ /cib/status/node_state[@id='2']/lrm[@id='2']/lrm_resources/lrm_resource[@id='dummy.service']:  <lrm_rsc_op id="dummy.service_last_failure_0" operation_key="dummy.service_stop_0" operation="stop" crm-debug-origin="do_update_resource" crm_feature_set="3.2.0" transition-key="163:24:0:76f4932e-716b-45b8-8fed-a20c3806df8a" transition-magic="2:198;163:24:0:76f4932e-716b-45b8-8fed-a20c3806df8a" exit-reason="" on_node="mfs4
> Jul 22 12:38:42 node2.local pacemaker-based     [1719] (cib_process_request)     info: Completed cib_modify operation for section status: OK (rc=0, origin=node2.local/crmd/67, version=0.132.2)
> Jul 22 12:38:42 node2.local pacemaker-attrd     [1722] (attrd_peer_update)     notice: Setting fail-count-dummy.service#stop_0[node2.local]: (unset) -> INFINITY | from node1.local
> Jul 22 12:38:42 node2.local pacemaker-attrd     [1722] (attrd_peer_update)     notice: Setting last-failure-dummy.service#stop_0[node2.local]: (unset) -> 1595410722 | from node1.local
> Jul 22 12:38:42 node2.local pacemaker-based     [1719] (cib_perform_op)     info: Diff: --- 0.132.2 2
> Jul 22 12:38:42 node2.local pacemaker-based     [1719] (cib_perform_op)     info: Diff: +++ 0.132.3 (null)
> Jul 22 12:38:42 node2.local pacemaker-based     [1719] (cib_perform_op)     info: +  /cib:  @num_updates=3
> Jul 22 12:38:42 node2.local pacemaker-based     [1719] (cib_perform_op)     info: ++ /cib/status/node_state[@id='2']/transient_attributes[@id='2']/instance_attributes[@id='status-2']:  <nvpair id="status-2-fail-count-dummy.service.stop_0" name="fail-count-dummy.service#stop_0" value="INFINITY"/>
> Jul 22 12:38:42 node2.local pacemaker-based     [1719] (cib_process_request)     info: Completed cib_modify operation for section status: OK (rc=0, origin=node1.local/attrd/34, version=0.132.3)
> Jul 22 12:38:42 node2.local pacemaker-based     [1719] (cib_perform_op)     info: Diff: --- 0.132.3 2
> Jul 22 12:38:42 node2.local pacemaker-based     [1719] (cib_perform_op)     info: Diff: +++ 0.132.4 (null)
> Jul 22 12:38:42 node2.local pacemaker-based     [1719] (cib_perform_op)     info: +  /cib:  @num_updates=4
> Jul 22 12:38:42 node2.local pacemaker-based     [1719] (cib_perform_op)     info: ++ /cib/status/node_state[@id='2']/transient_attributes[@id='2']/instance_attributes[@id='status-2']:  <nvpair id="status-2-last-failure-dummy.service.stop_0" name="last-failure-dummy.service#stop_0" value="1595410722"/>
> Jul 22 12:38:42 node2.local pacemaker-based     [1719] (cib_process_request)     info: Completed cib_modify operation for section status: OK (rc=0, origin=node1.local/attrd/35, version=0.132.4)
> Jul 22 12:38:42 node2.local pacemaker-based     [1719] (cib_perform_op)     info: Diff: --- 0.132.4 2
> Jul 22 12:38:42 node2.local pacemaker-based     [1719] (cib_perform_op)     info: Diff: +++ 0.132.5 (null)
> Jul 22 12:38:42 node2.local pacemaker-based     [1719] (cib_perform_op)     info: -- /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='drbd_docker1']/lrm_rsc_op[@id='drbd_docker1_monitor_60000']
> Jul 22 12:38:42 node2.local pacemaker-based     [1719] (cib_perform_op)     info: +  /cib:  @num_updates=5
> Jul 22 12:38:42 node2.local pacemaker-based     [1719] (cib_process_request)     info: Completed cib_delete operation for section status: OK (rc=0, origin=node1.local/crmd/246, version=0.132.5)
> Jul 22 12:38:42 node2.local pacemaker-based     [1719] (cib_perform_op)     info: Diff: --- 0.132.5 2
> Jul 22 12:38:42 node2.local pacemaker-based     [1719] (cib_perform_op)     info: Diff: +++ 0.132.6 (null)
> Jul 22 12:38:42 node2.local pacemaker-based     [1719] (cib_perform_op)     info: -- /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='drbd_docker2']/lrm_rsc_op[@id='drbd_docker2_monitor_60000']
> Jul 22 12:38:42 node2.local pacemaker-based     [1719] (cib_perform_op)     info: +  /cib:  @num_updates=6
> Jul 22 12:38:42 node2.local pacemaker-based     [1719] (cib_process_request)     info: Completed cib_delete operation for section status: OK (rc=0, origin=node1.local/crmd/247, version=0.132.6)
> Jul 22 12:38:47 node2.local pacemaker-based     [1719] (cib_process_ping)     info: Reporting our current digest to node1.local: 6406f330dbe96e8c3e732c29f91a5865 for 0.132.6 (0x55a3b5b7a640 0)
>   
>> Среда, 22 июля 2020, 14:17 +05:00 от Andrei Borzenkov <arvidjaar at gmail.com>:
>>  
>>    
>> On Wed, Jul 22, 2020 at 10:59 AM Хиль Эдуард < i.am.test at mail.ru > wrote:
>>> Hi there! I have 2 nodes with Pacemaker 2.0.3, corosync 3.0.3 on ubuntu 20 + 1 qdevice. I want to define new resource as systemd unit  dummy.service  :
>>>  
>>> [Unit]
>>> Description=Dummy
>>> [Service]
>>> Restart=on-failure
>>> StartLimitInterval=20
>>> StartLimitBurst=5
>>> TimeoutStartSec=0
>>> RestartSec=5
>>> Environment="HOME=/root"
>>> SyslogIdentifier=dummy
>>> ExecStart=/usr/local/sbin/dummy.sh
>>> [Install]
>>> WantedBy=multi-user.target
>>>  
>>> and /usr/local/sbin/dummy.sh :
>>>  
>>> #!/bin/bash
>>> CNT=0
>>> while true; do
>>>   let CNT++
>>>   echo "hello world $CNT"
>>>   sleep 5
>>> done
>>>  
>>> and then i try to define it with: pcs resource create dummy.service systemd:dummy op monitor interval="10s" timeout="15s"
>>> after 2 seconds node2 reboot.
>>  
>> Node reboots because stop operation failed, no start.
>>  
>>> In logs i see pacemaker in 2 seconds tried to start this unit, and it started, but pacemaker somehow think he is «Timed Out» . What i am doing wrong? Logs below.
>>>  
>>>  
>>> Jul 21 15:53:41 node2.local pacemaker-controld[1813]:  notice: Result of probe operation for dummy.service on node2.local: 7 (not running) 
>>> Jul 21 15:53:41 node2.local systemd[1]: Reloading.
>>> Jul 21 15:53:42 node2.local systemd[1]: /lib/systemd/system/dbus.socket:5: ListenStream= references a path below legacy directory /var/run/, updating /var/run/dbus/system_bus_socket → /run/dbus/system_bus_socket; please update the unit file accordingly.
>>> Jul 21 15:53:42 node2.local systemd[1]: /lib/systemd/system/docker.socket:6: ListenStream= references a path below legacy directory /var/run/, updating /var/run/docker.sock → /run/docker.sock; please update the unit file accordingly.
>>> Jul 21 15:53:42 node2.local pacemaker-execd[1808]:  notice: Giving up on dummy.service start (rc=0): timeout (elapsed=259719ms, remaining=-159719ms)
>>> Jul 21 15:53:42 node2.local pacemaker-controld[1813]:  error: Result of start operation for dummy.service on node2.local: Timed Out 
>>> Jul 21 15:53:42 node2.local systemd[1]: Started Cluster Controlled dummy.
>>> Jul 21 15:53:42 node2.local dummy[9330]: hello world 1
>>> Jul 21 15:53:42 node2.local systemd-udevd[922]: Network interface NamePolicy= disabled on kernel command line, ignoring.
>>> Jul 21 15:53:42 node2.local pacemaker-attrd[1809]:  notice: Setting fail-count-dummy.service#start_0[node2.local]: (unset) -> INFINITY 
>>> Jul 21 15:53:42 node2.local pacemaker-attrd[1809]:  notice: Setting last-failure-dummy.service#start_0[node2.local]: (unset) -> 1595336022 
>>> Jul 21 15:53:42 node2.local systemd[1]: Reloading.
>>> Jul 21 15:53:42 node2.local systemd[1]: /lib/systemd/system/dbus.socket:5: ListenStream= references a path below legacy directory /var/run/, updating /var/run/dbus/system_bus_socket → /run/dbus/system_bus_socket; please update the unit file accordingly.
>>> Jul 21 15:53:42 node2.local systemd[1]: /lib/systemd/system/docker.socket:6: ListenStream= references a path below legacy directory /var/run/, updating /var/run/docker.sock → /run/docker.sock; please update the unit file accordingly.
>>> Jul 21 15:53:42 node2.local pacemaker-execd[1808]:  notice: Giving up on dummy.service stop (rc=0): timeout (elapsed=317181ms, remaining=-217181ms)
>>  
>> 317181ms == 5 minutes. Barring pacemaker bug, you need to show pacemaker log since the very first start operation so we can see proper timing. Seeing that systemd was reloaded in between, it is quite possible that systemd lost track of pending job so any client waiting for confirmation hangs forever. Such problems were known, not sure what current status is (if it ever was fixed).
>>  
>>> Jul 21 15:53:42 node2.local pacemaker-controld[1813]:  error: Result of stop operation for dummy.service on node2.local: Timed Out 
>>> Jul 21 15:53:42 node2.local systemd[1]: Stopping Daemon for dummy...
>>> Jul 21 15:53:42 node2.local pacemaker-attrd[1809]:  notice: Setting fail-count-dummy.service#stop_0[node2.local]: (unset) -> INFINITY 
>>> Jul 21 15:53:42 node2.local pacemaker-attrd[1809]:  notice: Setting last-failure-dummy.service#stop_0[node2.local]: (unset) -> 1595336022 
>>> Jul 21 15:53:42 node2.local systemd[1]: dummy.service: Succeeded.
>>> Jul 21 15:53:42 node2.local systemd[1]: Stopped Daemon for dummy.
>>> ... lost connection (node rebooting)
>>>  
>>>   _______________________________________________
>>> Manage your subscription:
>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>
>>> ClusterLabs home:  https://www.clusterlabs.org/ 
>  
>  
> --
> Хиль Эдуард
>  
> 



More information about the Users mailing list