[Pacemaker] long time to start
Schaefer, Diane E
diane.schaefer at unisys.com
Wed Apr 21 11:07:40 EDT 2010
>> Hi,
>> I have a resource that sometimes can take 10 minutes to start after
>> a failure due to log records that need to be sync'd. (my own OCF)
>>
>> I noticed while the start action was being performed, if other
>>
>> resources in my cluster report a "not running", no restart will be
>> attempted until my long running started resource returns.
>>
>> Meanwhile, the crm_mon reports the resources as "started"
>> eventhough they are not running, and may not be for many minutes.
>> Is the lrm process single threaded?
>You are saying that while your RA starts (with a long start timeout),
>and the start action is not yet complete,
>other _independend_ resources are not yet started,
>but crm_mon thinks they are running already,
>even though "something" (what?) reports "not running" for those?
Yes, I am saying that if a resource (R1) is taking a long time to start and another resource (R2) monitor action returns a not running, it will not be restarted until the first stuck resource returns or in my case times out. Since the stop action has not been run on R2, crm_mon still says "Started"
>I think you lost me ;)
>please show a "crm configure show"
primitive dummy-1 ocf:heartbeat:Dummy \
op monitor interval="30s" \
op start interval="0" timeout="90s" migration-threshold="0"
primitive dummy-main ocf:heartbeat:Dummy \
op monitor interval="30s" \
op start interval="0" timeout="30s" \
meta migration-threshold="0" target-role="Started"
primitive dummy-sleep ocf:heartbeat:DummySleep \
op monitor interval="60s" \
op start interval="0" timeout="2m" \
meta migration-threshold="0" target-role="Started"
colocation d inf: dummy-sleep dummy-main
colocation d1 inf: dummy-1 dummy-main
property $id="cib-bootstrap-options" \
dc-version="1.0.6-cebe2b6ff49b36b29a3bd7ada1c4701c7470febe" \
cluster-infrastructure="Heartbeat" \
stonith-enabled="false" \
last-lrm-refresh="1271853339
>Can you reproduce this easily?
Not easily, but I finally have the correct combination. In my case I have dependent resources, but I was able to reproduce part of the issue using the Dummy resource.
>Can you reproduce this with just a few "Dummy" resources?
I added an ocf_log to the monitor action so I could tail the messages file to see what was happening. I created another resource "DummySleep" where I inserted a sleep as follows:
dummy_start() {
ocf_log info "OCF_RESKEY_state is ${OCF_RESKEY_state}"
dummy_monitor
ret=$?
ocf_log info "dummy start sleep..."
sleep 3000
return $OCF_ERR_GENERIC
ocf_log info "dummy start sleep return..."
if [ $ret = $OCF_SUCCESS ]; then
return $OCF_SUCCESS
fi
touch ${OCF_RESKEY_state}
I ran the test as follows:
1) Commented out the sleep and return to get the DummySleep resource going with the others
2) Replaced the DummySleep OCF with sleep turned on
3) Ran crm resource stop dummy-sleep
4) Crm resource start dummy-sleep to cause it to sleep
5) "rm" the state file for dummy-main to cause the failure
You'll notice that the monitor for dummy-main keeps going even though it's not running until the dummy-sleep resource is woken up.
Apr 21 10:06:56 qpr1 lrmd: [30826]: info: RA output: (dummy-sleep:start:stderr) 2010/04/21_10:06:56 INFO: dummy start sleep...
Apr 21 10:07:01 qpr1 lrmd: [30826]: info: RA output: (dummy-main:monitor:stderr) 2010/04/21_10:07:01 INFO: dummy monitor
Apr 21 10:07:26 qpr1 lrmd: [30826]: info: RA output: (dummy-1:monitor:stderr) 2010/04/21_10:07:26 INFO: dummy monitor
Apr 21 10:07:31 qpr1 lrmd: [30826]: info: RA output: (dummy-main:monitor:stderr) 2010/04/21_10:07:31 INFO: dummy monitor
Apr 21 10:07:31 qpr1 lrmd: [30826]: info: RA output: (dummy-main:monitor:stderr) 2010/04/21_10:07:31 INFO: Not Running
Apr 21 10:07:31 qpr1 crmd: [30829]: info: process_lrm_event: LRM operation dummy-main_monitor_30000 (call=135, rc=7, cib-update=204, confirmed=false) not running
Apr 21 10:07:32 qpr1 attrd: [30828]: info: attrd_ha_callback: Update relayed from qpr2
Apr 21 10:07:32 qpr1 attrd: [30828]: info: attrd_local_callback: Expanded fail-count-dummy-main=value++ to 7
Apr 21 10:07:32 qpr1 attrd: [30828]: info: attrd_trigger_update: Sending flush op to all hosts for: fail-count-dummy-main (7)
Apr 21 10:07:32 qpr1 attrd: [30828]: info: attrd_perform_update: Sent update 107: fail-count-dummy-main=7
Apr 21 10:07:32 qpr1 attrd: [30828]: info: attrd_ha_callback: Update relayed from qpr2
Apr 21 10:07:32 qpr1 attrd: [30828]: info: attrd_trigger_update: Sending flush op to all hosts for: last-failure-dummy-main (1271858866)
Apr 21 10:07:32 qpr1 attrd: [30828]: info: attrd_perform_update: Sent update 109: last-failure-dummy-main=1271858866
Apr 21 10:07:56 qpr1 lrmd: [30826]: info: RA output: (dummy-1:monitor:stderr) 2010/04/21_10:07:56 INFO: dummy monitor
Apr 21 10:08:01 qpr1 lrmd: [30826]: info: RA output: (dummy-main:monitor:stderr) 2010/04/21_10:08:01 INFO: dummy monitor
Apr 21 10:08:01 qpr1 lrmd: [30826]: info: RA output: (dummy-main:monitor:stderr) 2010/04/21_10:08:01 INFO: Not Running
Apr 21 10:08:26 qpr1 lrmd: [30826]: info: RA output: (dummy-1:monitor:stderr) 2010/04/21_10:08:26 INFO: dummy monitor
Apr 21 10:08:31 qpr1 lrmd: [30826]: info: RA output: (dummy-main:monitor:stderr) 2010/04/21_10:08:31 INFO: dummy monitor
Apr 21 10:08:01 qpr1 lrmd: [30826]: info: RA output: (dummy-main:monitor:stderr) 2010/04/21_10:08:01 INFO: Not Running
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20100421/c90d9105/attachment-0001.html>
More information about the Pacemaker
mailing list