<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv=Content-Type content="text/html; charset=us-ascii">
<meta name=Generator content="Microsoft Word 12 (filtered medium)">
<style>
<!--
/* Font Definitions */
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
{font-family:Consolas;
panose-1:2 11 6 9 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0in;
margin-bottom:.0001pt;
font-size:11.0pt;
font-family:"Calibri","sans-serif";}
a:link, span.MsoHyperlink
{mso-style-priority:99;
color:blue;
text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
{mso-style-priority:99;
color:purple;
text-decoration:underline;}
p.MsoPlainText, li.MsoPlainText, div.MsoPlainText
{mso-style-priority:99;
mso-style-link:"Plain Text Char";
margin:0in;
margin-bottom:.0001pt;
font-size:10.5pt;
font-family:Consolas;}
span.EmailStyle17
{mso-style-type:personal-compose;
font-family:"Calibri","sans-serif";
color:windowtext;}
span.PlainTextChar
{mso-style-name:"Plain Text Char";
mso-style-priority:99;
mso-style-link:"Plain Text";
font-family:Consolas;}
.MsoChpDefault
{mso-style-type:export-only;}
@page Section1
{size:8.5in 11.0in;
margin:1.0in 1.0in 1.0in 1.0in;}
div.Section1
{page:Section1;}
/* List Definitions */
@list l0
{mso-list-id:1984458087;
mso-list-type:hybrid;
mso-list-template-ids:-1581583520 67698705 67698713 67698715 67698703 67698713 67698715 67698703 67698713 67698715;}
@list l0:level1
{mso-level-text:"%1\)";
mso-level-tab-stop:none;
mso-level-number-position:left;
text-indent:-.25in;}
ol
{margin-bottom:0in;}
ul
{margin-bottom:0in;}
-->
</style>
<!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
</head>
<body lang=EN-US link=blue vlink=purple>
<div class=Section1>
<p class=MsoPlainText>>> Hi,<o:p></o:p></p>
<p class=MsoPlainText>>> I have a resource that sometimes can
take 10 minutes to start after<o:p></o:p></p>
<p class=MsoPlainText>>> a failure due to log records that
need to be sync'd. (my own OCF)<o:p></o:p></p>
<p class=MsoPlainText>>><o:p> </o:p></p>
<p class=MsoPlainText>>> I noticed while the start action was
being performed, if other<o:p></o:p></p>
<p class=MsoPlainText>>><o:p> </o:p></p>
<p class=MsoPlainText>>> resources in my cluster report a
"not running", no restart will be<o:p></o:p></p>
<p class=MsoPlainText>>> attempted until my long running
started resource returns.<o:p></o:p></p>
<p class=MsoPlainText>>><o:p> </o:p></p>
<p class=MsoPlainText>>> Meanwhile, the crm_mon reports
the resources as "started"<o:p></o:p></p>
<p class=MsoPlainText>>> eventhough they are not running, and
may not be for many minutes.<o:p></o:p></p>
<p class=MsoPlainText>>> Is the lrm process single threaded?<o:p></o:p></p>
<p class=MsoPlainText><o:p> </o:p></p>
<p class=MsoPlainText>>You are saying that while your RA starts (with a long
start timeout),<o:p></o:p></p>
<p class=MsoPlainText>>and the start action is not yet complete,<o:p></o:p></p>
<p class=MsoPlainText>>other _independend_ resources are not yet started,<o:p></o:p></p>
<p class=MsoPlainText>>but crm_mon thinks they are running already,<o:p></o:p></p>
<p class=MsoPlainText>>even though "something" (what?) reports
"not running" for those?<o:p></o:p></p>
<p class=MsoPlainText><o:p> </o:p></p>
<p class=MsoPlainText>Yes, I am saying that if a resource (R1) is taking a long
time to start and another resource (R2) monitor action returns a not running,
it will not be restarted until the first stuck resource returns or in my case
times out. Since the stop action has not been run on R2, crm_mon still
says “Started”<o:p></o:p></p>
<p class=MsoPlainText><o:p> </o:p></p>
<p class=MsoPlainText>>I think you lost me ;)<o:p></o:p></p>
<p class=MsoPlainText><o:p> </o:p></p>
<p class=MsoPlainText>>please show a "crm configure show"<o:p></o:p></p>
<p class=MsoPlainText>primitive dummy-1 ocf:heartbeat:Dummy \<o:p></o:p></p>
<p class=MsoPlainText> op monitor
interval="30s" \<o:p></o:p></p>
<p class=MsoPlainText> op start
interval="0" timeout="90s" migration-threshold="0"<o:p></o:p></p>
<p class=MsoPlainText>primitive dummy-main ocf:heartbeat:Dummy \<o:p></o:p></p>
<p class=MsoPlainText> <span lang=NL>op
monitor interval="30s" \<o:p></o:p></span></p>
<p class=MsoPlainText><span lang=NL>
op start interval="0" timeout="30s" \<o:p></o:p></span></p>
<p class=MsoPlainText><span lang=NL> </span>meta
migration-threshold="0" target-role="Started"<o:p></o:p></p>
<p class=MsoPlainText>primitive dummy-sleep ocf:heartbeat:DummySleep \<o:p></o:p></p>
<p class=MsoPlainText> <span lang=NL>op
monitor interval="60s" \<o:p></o:p></span></p>
<p class=MsoPlainText><span lang=NL>
op start interval="0" timeout="2m" \<o:p></o:p></span></p>
<p class=MsoPlainText><span lang=NL> </span>meta
migration-threshold="0" target-role="Started"<o:p></o:p></p>
<p class=MsoPlainText>colocation d inf: dummy-sleep dummy-main<o:p></o:p></p>
<p class=MsoPlainText>colocation d1 inf: dummy-1 dummy-main<o:p></o:p></p>
<p class=MsoPlainText>property $id="cib-bootstrap-options" \<o:p></o:p></p>
<p class=MsoPlainText>
dc-version="1.0.6-cebe2b6ff49b36b29a3bd7ada1c4701c7470febe" \<o:p></o:p></p>
<p class=MsoPlainText>
cluster-infrastructure="Heartbeat" \<o:p></o:p></p>
<p class=MsoPlainText>
stonith-enabled="false" \<o:p></o:p></p>
<p class=MsoPlainText>
last-lrm-refresh="1271853339<o:p></o:p></p>
<p class=MsoPlainText><o:p> </o:p></p>
<p class=MsoPlainText>>Can you reproduce this easily?<o:p></o:p></p>
<p class=MsoPlainText>Not easily, but I finally have the correct
combination. In my case I have dependent resources, but I was able to
reproduce part of the issue using the Dummy resource.<o:p></o:p></p>
<p class=MsoPlainText>>Can you reproduce this with just a few
"Dummy" resources?<o:p></o:p></p>
<p class=MsoPlainText>I added an ocf_log to the monitor action so I could tail
the messages file to see what was happening. I created another resource “DummySleep”
where I inserted a sleep as follows:<o:p></o:p></p>
<p class=MsoPlainText><o:p> </o:p></p>
<p class=MsoPlainText>dummy_start() {<o:p></o:p></p>
<p class=MsoPlainText> ocf_log info "OCF_RESKEY_state is
${OCF_RESKEY_state}"<o:p></o:p></p>
<p class=MsoPlainText> dummy_monitor<o:p></o:p></p>
<p class=MsoPlainText> ret=$?<o:p></o:p></p>
<p class=MsoPlainText> ocf_log info "dummy start
sleep..."<o:p></o:p></p>
<p class=MsoPlainText> sleep 3000<o:p></o:p></p>
<p class=MsoPlainText> return $OCF_ERR_GENERIC<o:p></o:p></p>
<p class=MsoPlainText> ocf_log info "dummy start sleep
return..."<o:p></o:p></p>
<p class=MsoPlainText> if [ $ret = $OCF_SUCCESS ]; then<o:p></o:p></p>
<p class=MsoPlainText> return
$OCF_SUCCESS<o:p></o:p></p>
<p class=MsoPlainText> fi<o:p></o:p></p>
<p class=MsoPlainText> touch ${OCF_RESKEY_state}<o:p></o:p></p>
<p class=MsoPlainText><o:p> </o:p></p>
<p class=MsoPlainText>I ran the test as follows:<o:p></o:p></p>
<p class=MsoPlainText style='margin-left:.5in;text-indent:-.25in;mso-list:l0 level1 lfo1'><![if !supportLists]><span
style='mso-list:Ignore'>1)<span style='font:7.0pt "Times New Roman"'> </span></span><![endif]>Commented
out the sleep and return to get the DummySleep resource going with the others<o:p></o:p></p>
<p class=MsoPlainText style='margin-left:.5in;text-indent:-.25in;mso-list:l0 level1 lfo1'><![if !supportLists]><span
style='mso-list:Ignore'>2)<span style='font:7.0pt "Times New Roman"'> </span></span><![endif]>Replaced
the DummySleep OCF with sleep turned on <o:p></o:p></p>
<p class=MsoPlainText style='margin-left:.5in;text-indent:-.25in;mso-list:l0 level1 lfo1'><![if !supportLists]><span
style='mso-list:Ignore'>3)<span style='font:7.0pt "Times New Roman"'> </span></span><![endif]>Ran
crm resource stop dummy-sleep<o:p></o:p></p>
<p class=MsoPlainText style='margin-left:.5in;text-indent:-.25in;mso-list:l0 level1 lfo1'><![if !supportLists]><span
style='mso-list:Ignore'>4)<span style='font:7.0pt "Times New Roman"'> </span></span><![endif]>Crm
resource start dummy-sleep to cause it to sleep<o:p></o:p></p>
<p class=MsoPlainText style='margin-left:.5in;text-indent:-.25in;mso-list:l0 level1 lfo1'><![if !supportLists]><span
style='mso-list:Ignore'>5)<span style='font:7.0pt "Times New Roman"'> </span></span><![endif]>“rm”
the state file for dummy-main to cause the failure<o:p></o:p></p>
<p class=MsoPlainText style='margin-left:.5in'><o:p> </o:p></p>
<p class=MsoPlainText style='margin-left:.5in'><o:p> </o:p></p>
<p class=MsoPlainText>You’ll notice that the monitor for dummy-main keeps
going even though it’s not running until the dummy-sleep resource is
woken up.<o:p></o:p></p>
<p class=MsoPlainText><o:p> </o:p></p>
<p class=MsoPlainText>Apr 21 10:06:56 qpr1 lrmd: [30826]: info: RA output:
(dummy-sleep:start:stderr) 2010/04/21_10:06:56 INFO: dummy start sleep...<o:p></o:p></p>
<p class=MsoPlainText>Apr 21 10:07:01 qpr1 lrmd: [30826]: info: RA output:
(dummy-main:monitor:stderr) 2010/04/21_10:07:01 INFO: dummy monitor<o:p></o:p></p>
<p class=MsoPlainText>Apr 21 10:07:26 qpr1 lrmd: [30826]: info: RA output:
(dummy-1:monitor:stderr) 2010/04/21_10:07:26 INFO: dummy monitor<o:p></o:p></p>
<p class=MsoPlainText>Apr 21 10:07:31 qpr1 lrmd: [30826]: info: RA output:
(dummy-main:monitor:stderr) 2010/04/21_10:07:31 INFO: dummy monitor<o:p></o:p></p>
<p class=MsoPlainText>Apr 21 10:07:31 qpr1 lrmd: [30826]: info: RA output:
(dummy-main:monitor:stderr) 2010/04/21_10:07:31 INFO: Not Running<o:p></o:p></p>
<p class=MsoPlainText>Apr 21 10:07:31 qpr1 crmd: [30829]: info:
process_lrm_event: LRM operation dummy-main_monitor_30000 (call=135, rc=7,
cib-update=204, confirmed=false) not running<o:p></o:p></p>
<p class=MsoPlainText>Apr 21 10:07:32 qpr1 attrd: [30828]: info:
attrd_ha_callback: Update relayed from qpr2<o:p></o:p></p>
<p class=MsoPlainText>Apr 21 10:07:32 qpr1 attrd: [30828]: info:
attrd_local_callback: Expanded fail-count-dummy-main=value++ to 7<o:p></o:p></p>
<p class=MsoPlainText>Apr 21 10:07:32 qpr1 attrd: [30828]: info:
attrd_trigger_update: Sending flush op to all hosts for: fail-count-dummy-main
(7)<o:p></o:p></p>
<p class=MsoPlainText>Apr 21 10:07:32 qpr1 attrd: [30828]: info:
attrd_perform_update: Sent update 107: fail-count-dummy-main=7<o:p></o:p></p>
<p class=MsoPlainText>Apr 21 10:07:32 qpr1 attrd: [30828]: info:
attrd_ha_callback: Update relayed from qpr2<o:p></o:p></p>
<p class=MsoPlainText>Apr 21 10:07:32 qpr1 attrd: [30828]: info:
attrd_trigger_update: Sending flush op to all hosts for:
last-failure-dummy-main (1271858866)<o:p></o:p></p>
<p class=MsoPlainText>Apr 21 10:07:32 qpr1 attrd: [30828]: info:
attrd_perform_update: Sent update 109: last-failure-dummy-main=1271858866<o:p></o:p></p>
<p class=MsoPlainText>Apr 21 10:07:56 qpr1 lrmd: [30826]: info: RA output:
(dummy-1:monitor:stderr) 2010/04/21_10:07:56 INFO: dummy monitor<o:p></o:p></p>
<p class=MsoPlainText>Apr 21 10:08:01 qpr1 lrmd: [30826]: info: RA output:
(dummy-main:monitor:stderr) 2010/04/21_10:08:01 INFO: dummy monitor<o:p></o:p></p>
<p class=MsoPlainText>Apr 21 10:08:01 qpr1 lrmd: [30826]: info: RA output:
(dummy-main:monitor:stderr) 2010/04/21_10:08:01 INFO: Not Running<o:p></o:p></p>
<p class=MsoPlainText>Apr 21 10:08:26 qpr1 lrmd: [30826]: info: RA output:
(dummy-1:monitor:stderr) 2010/04/21_10:08:26 INFO: dummy monitor<o:p></o:p></p>
<p class=MsoPlainText>Apr 21 10:08:31 qpr1 lrmd: [30826]: info: RA output:
(dummy-main:monitor:stderr) 2010/04/21_10:08:31 INFO: dummy monitor<o:p></o:p></p>
<p class=MsoPlainText>Apr 21 10:08:01 qpr1 lrmd: [30826]: info: RA output:
(dummy-main:monitor:stderr) 2010/04/21_10:08:01 INFO: Not Running<o:p></o:p></p>
<p class=MsoPlainText><o:p> </o:p></p>
<p class=MsoNormal><o:p> </o:p></p>
</div>
</body>
</html>