<html><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; ">My apologies... we (me, Lars and Keisuke) discussed this at the cluster summit<span class="Apple-style-span" style="font-family: -webkit-monospace; "> and I was supposed to summarize the results (but I didn't find the time until now).</span><div><font class="Apple-style-span" face="-webkit-monospace"><br></font></div><div><span class="Apple-style-span" style="font-family: -webkit-monospace; ">Essentially we decided that my idea, which you have implemented here, wouldn't work :-(</span></div><div><font class="Apple-style-span" face="-webkit-monospace"><br></font></div><div><font class="Apple-style-span" face="-webkit-monospace"><br></font></div><div><br></div><div><font class="Apple-style-span" face="-webkit-monospace">- If the initial request is lost due to congestion, then the loop will only be executed once</font></div><div><font class="Apple-style-span" face="-webkit-monospace"><span class="Apple-style-span" style="font-family: 'Lucida Grande'; "><div><font class="Apple-style-span" face="-webkit-monospace"> (Assuming the RA makes a request to a server/daemon as part of the resource's health check)</font></div><div><font class="Apple-style-span" face="-webkit-monospace"><br></font></div><div><font class="Apple-style-span" face="-webkit-monospace"> This makes the loop no better than a single monitor operation with a long timeout.</font></div><div><font class="Apple-style-span" face="-webkit-monospace"><br></font></div></span></font></div><div><font class="Apple-style-span" face="-webkit-monospace"><span class="Apple-style-span" style="font-family: 'Lucida Grande'; "><font class="Apple-style-span" face="-webkit-monospace">- Looping the monitor action as a whole (whether driven by the pengine, lrmd or RA) is not a good idea<br></font></span></font></div><div><span class="Apple-style-span" style="font-family: -webkit-monospace; "> - Re-executing the complete loop is inefficient.</span></div><div><font class="Apple-style-span" face="-webkit-monospace"><br></font></div><div><font class="Apple-style-span" face="-webkit-monospace"> For example, there is no need to re-check the contents of a PID or configuration file each time.</font></div><div><div><span class="Apple-style-span" style="font-family: -webkit-monospace; "> This indicates that any looping should occur within the monitor operation itself.</span></div><div></div></div><div><font class="Apple-style-span" face="-webkit-monospace"><br></font></div><div><font class="Apple-style-span" face="-webkit-monospace"> - It unnecessarily delays the cluster's recovery of some failures.</font></div><div><font class="Apple-style-span" face="-webkit-monospace"><br></font></div><div><font class="Apple-style-span" face="-webkit-monospace"> For example, if the daemon's process doesn't exist, then no amount of looping will bring it back.</font></div><div><font class="Apple-style-span" face="-webkit-monospace"> In such cases, the RA should return immediately. However the presence of a loop prohibits this.</font></div><div><font class="Apple-style-span" face="-webkit-monospace"><br></font></div><div><font class="Apple-style-span" face="-webkit-monospace"><span class="Apple-style-span" style="font-family: 'Lucida Grande'; "><div><span class="Apple-style-span" style="font-family: -webkit-monospace; ">- Lars also expressed the fear that others would enable this functionality for the wrong reasons and the general quality of the monitor actions would decrease as a result.</span></div><div><font class="Apple-style-span" face="-webkit-monospace"><br></font></div><div><font class="Apple-style-span" face="-webkit-monospace"><br></font></div><div><font class="Apple-style-span" face="-webkit-monospace">The most important part though is that because only parts of the monitor operation should be repeated (and only under some circumstances), the loop must be _inside_ the monitor operation</font></div><div><br></div><div><font class="Apple-style-span" face="-webkit-monospace">This rules out crmd/PE/lrmd involvement and means that each RA requiring this functionality would need to be modified individually.</font></div><div><font class="Apple-style-span" face="-webkit-monospace"><br></font></div><div><font class="Apple-style-span" face="-webkit-monospace">This is consistent with the idea that only the RA knows enough about the resource to know when it has truly failed and therefor monitor must do whatever it needs to do in order to return a definitive result.</font></div><div><font class="Apple-style-span" face="-webkit-monospace"><br></font></div><div><font class="Apple-style-span" face="-webkit-monospace"><br></font></div><div><span class="Apple-style-span" style="font-family: -webkit-monospace; ">It might be necessary to write a small utility in C to assist the RA in running specific parts of the monitor action with a timeout, however wget may be sufficient for the few resources that require this functionality (as it already allows the number of retries and timeouts to be specified).</span></div><div><font class="Apple-style-span" face="-webkit-monospace"><br></font></div><div><font class="Apple-style-span" face="-webkit-monospace"><br></font></div><div><font class="Apple-style-span" face="-webkit-monospace">Please let me know if anything about was not clear.</font></div><div><font class="Apple-style-span" face="-webkit-monospace"><br></font></div><div><font class="Apple-style-span" face="-webkit-monospace">Andrew</font></div></span></font></div><div><font class="Apple-style-span" face="-webkit-monospace"><br></font></div><div>On Oct 7, 2008, at 12:55 PM, Satomi TANIGUCHI wrote:</div><div><div><div><br class="Apple-interchange-newline"><blockquote type="cite">Hi,<br><br><br>I'm posting patches to add "monitor-loop" operation.<br>Each patch's roles are:<br>(1) monitor_loop_hb.patch: add ocf_monitor_loop() in .ocf-shellfuncs.<br> This is for Heartbeat(83a87f2b6554).<br>(2) monitor_loop_pm.patch: add "monitor-loop" operation to cib.<br> This is for Pacemaker(0f6fc6f8c01f).<br><br>1. Specifications<br>monitor-loop operation calls monitor op consecutively until:<br>(1) monitor op returns normal value (OCF_SUCCESS or OCF_RUNNING_MASTER).<br>(2) count of failures becomes more than threshold.<br><br>To set the threshold value, add a new attribute "maxfailures"<br>in each resource's <instance_attributes>.<br>If you don't set the threshold, or if you set zero,<br>monitor-loop op never returns until it detects monitor op's success.<br>And an operation timeout will occur.<br><br>2. How to USE<br>(1) Add the following 1 line between "case $__OCF_ACTION in" and "esac"<br> in your RA.<br> monitor-loop) ocf_monitor_loop ${OCF_RESKEY_maxfailures};;<br> As an example, I attached a patch for Dummy resource<br> (monitor_loop_Dummy.patch).<br>(2) Describe cib.xml.<br> Add "maxfailures" in <instance_attributes>, and add "monitor-loop" operation<br> instead of a regular monitor op.<br> ex.)<br> <primitive id="prmDummy1" class="ocf" type="Dummy" provider="heartbeat"><br> <instance_attributes id="prmDummy1-instance-attributes"><br> <nvpair id="prmDummy1-instance-attrs-maxfailures" name="maxfailures" val<br> ue="3"/><br> </instance_attributes><br> <operations><br> <op id="prmDummy1-operations-start" name="start" interval="0" timeout="3<br> 00" on-fail="restart"/><br> <op id="prmDummy1-operations-monitor-loop" name="monitor-loop" interval=<br> "10" timeout="60" on-fail="restart"/><br> <op id="prmDummy1-operations-stop" name="stop" interval="0" timeout="300<br> " on-fail="block"/><br> </operations><br> </primitive><br><br>3. NOTE<br>monitor-loop operation is only for OCF resources, not for STONITH resources.<br><br><br>Thank you very much for your advices, Andrew and Lars!<br>With just a little alteration, I could realize what I considered.<br><br>Now I would like to hear your opinions.<br>For OCF resources, it's easy to add monitor-loop operation due to<br>.ocf-shellfuncs.<br>But STONITH resources don't have any common file like that.<br>So, when I want to add monitor-loop (or status-loop) operation in<br>STONITH resources, I have to add a function each of them.<br>It is almost the same as to modify each status function of them...<br><br>Even if we leave out monitor-loop operation,<br>STONITH resources should have same common file like OCF resources?<br><br><br>Your comments and suggestions are really appreciated.<br><br><br>Best Regards,<br>Satomi TANIGUCHI<br><br><br><br><br><br>Lars Marowsky-Bree wrote:<br><blockquote type="cite">On 2008-09-17T10:09:21, Andrew Beekhof <<a href="mailto:beekhof@gmail.com">beekhof@gmail.com</a>> wrote:<br></blockquote><blockquote type="cite"><blockquote type="cite">I can't help but feel this is all a work-around for badly written RAs and/or overly aggressive timeouts. There's nothing wrong with setting large timeouts... if you set 1 hour and the op returns in 1 second, then we don't wait around doing nothing for the other 59 minutes and 59 seconds.<br></blockquote></blockquote><blockquote type="cite">Agreed. RAs shouldn't fail randomly. RAs are considered part of the<br></blockquote><blockquote type="cite">"trusted" infrastructure.<br></blockquote><blockquote type="cite"><blockquote type="cite">But if you really really only want to report an error if N monitors fail in M seconds (I still think this is crazy, but whatever), then simply implement monitor_loop() which calls monitor() up to N times looking for $OCF_SUCCESS and add:<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"> <op id=... name="monitor_loop" timeout="M" interval=... /><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">instead of a regular monitor op. Or even in addition to a regular monitor op with on_fail=ignore if you want.<br></blockquote></blockquote><blockquote type="cite">Best idea so far.<br></blockquote><blockquote type="cite">Regards,<br></blockquote><blockquote type="cite"> Lars<br></blockquote><br>diff -r 83a87f2b6554 resources/OCF/.ocf-shellfuncs.in<br>--- a/resources/OCF/.ocf-shellfuncs.in<span class="Apple-tab-span" style="white-space:pre"> </span>Sat Oct 04 15:54:26 2008 +0200<br>+++ b/resources/OCF/.ocf-shellfuncs.in<span class="Apple-tab-span" style="white-space:pre"> </span>Tue Oct 07 17:43:38 2008 +0900<br>@@ -234,4 +234,35 @@<br> trap "rm -f $lockfile" EXIT<br> }<br><br>+ocf_monitor_loop() {<br>+ local max=0<br>+ local cnt=0<br>+ <br>+ if [ -n "$1" ]; then<br>+ max=$1<br>+ fi<br>+<br>+ if [ ${max} -lt 0 ]; then<br>+ ocf_log error "ocf_monitor_loop: ${OCF_RESOURCE_INSTANCE}: maxfailures has invalid value ${max}."<br>+ max=0<br>+ fi<br>+<br>+ while :<br>+ do<br>+ $0 monitor<br>+ ret=$?<br>+ ocf_log debug "ocf_monitor_loop: ${OCF_RESOURCE_INSTANCE}: monitor's return code is ${ret}."<br>+<br>+ if [ ${ret} -eq $OCF_SUCCESS -o ${ret} -eq $OCF_RUNNING_MASTER ]; then<br>+ break<br>+ fi<br>+ cnt=`expr ${cnt} + 1`<br>+ ocf_log warn "ocf_monitor_loop: ${OCF_RESOURCE_INSTANCE}: monitor is failed ${cnt} times."<br>+<br>+ if [ ${max} -gt 0 -a ${cnt} -ge ${max} ]; then<br>+ break<br>+ fi<br>+ done<br>+ return ${ret}<br>+}<br> __ocf_set_defaults "$@"<br>diff -r 0f6fc6f8c01f include/crm/crm.h<br>--- a/include/crm/crm.h<span class="Apple-tab-span" style="white-space:pre"> </span>Mon Oct 06 18:27:13 2008 +0200<br>+++ b/include/crm/crm.h<span class="Apple-tab-span" style="white-space:pre"> </span>Tue Oct 07 17:43:57 2008 +0900<br>@@ -190,6 +190,7 @@<br> #define CRMD_ACTION_NOTIFIED<span class="Apple-tab-span" style="white-space:pre"> </span><span class="Apple-tab-span" style="white-space:pre"> </span>"notified"<br><br> #define CRMD_ACTION_STATUS<span class="Apple-tab-span" style="white-space:pre"> </span><span class="Apple-tab-span" style="white-space:pre"> </span>"monitor"<br>+#define CRMD_ACTION_STATUS_LOOP<span class="Apple-tab-span" style="white-space:pre"> </span><span class="Apple-tab-span" style="white-space:pre"> </span>"monitor-loop"<br><br> /* short names */<br> #define RSC_DELETE<span class="Apple-tab-span" style="white-space:pre"> </span>CRMD_ACTION_DELETE<br>diff -r 0f6fc6f8c01f include/crm/pengine/common.h<br>--- a/include/crm/pengine/common.h<span class="Apple-tab-span" style="white-space:pre"> </span>Mon Oct 06 18:27:13 2008 +0200<br>+++ b/include/crm/pengine/common.h<span class="Apple-tab-span" style="white-space:pre"> </span>Tue Oct 07 17:43:57 2008 +0900<br>@@ -52,7 +52,8 @@<br> <span class="Apple-tab-span" style="white-space:pre"> </span>action_demote,<br> <span class="Apple-tab-span" style="white-space:pre"> </span>action_demoted,<br> <span class="Apple-tab-span" style="white-space:pre"> </span>shutdown_crm,<br>-<span class="Apple-tab-span" style="white-space:pre"> </span>stonith_node<br>+<span class="Apple-tab-span" style="white-space:pre"> </span>stonith_node,<br>+<span class="Apple-tab-span" style="white-space:pre"> </span>monitor_loop_rsc<br> };<br><br> enum rsc_recovery_type {<br>diff -r 0f6fc6f8c01f lib/pengine/common.c<br>--- a/lib/pengine/common.c<span class="Apple-tab-span" style="white-space:pre"> </span>Mon Oct 06 18:27:13 2008 +0200<br>+++ b/lib/pengine/common.c<span class="Apple-tab-span" style="white-space:pre"> </span>Tue Oct 07 17:43:57 2008 +0900<br>@@ -212,6 +212,8 @@<br> <span class="Apple-tab-span" style="white-space:pre"> </span><span class="Apple-tab-span" style="white-space:pre"> </span>return no_action;<span class="Apple-tab-span" style="white-space:pre"> </span><br> <span class="Apple-tab-span" style="white-space:pre"> </span>} else if(safe_str_eq(task, "all_stopped")) {<br> <span class="Apple-tab-span" style="white-space:pre"> </span><span class="Apple-tab-span" style="white-space:pre"> </span>return no_action;<span class="Apple-tab-span" style="white-space:pre"> </span><br>+<span class="Apple-tab-span" style="white-space:pre"> </span>} else if(safe_str_eq(task, CRMD_ACTION_STATUS_LOOP)) {<br>+<span class="Apple-tab-span" style="white-space:pre"> </span><span class="Apple-tab-span" style="white-space:pre"> </span>return monitor_loop_rsc;<span class="Apple-tab-span" style="white-space:pre"> </span><br> <span class="Apple-tab-span" style="white-space:pre"> </span>} <br> <span class="Apple-tab-span" style="white-space:pre"> </span>crm_debug("Unsupported action: %s", task);<br> <span class="Apple-tab-span" style="white-space:pre"> </span>return no_action;<br>@@ -265,6 +267,9 @@<br> <span class="Apple-tab-span" style="white-space:pre"> </span><span class="Apple-tab-span" style="white-space:pre"> </span><span class="Apple-tab-span" style="white-space:pre"> </span>break;<br> <span class="Apple-tab-span" style="white-space:pre"> </span><span class="Apple-tab-span" style="white-space:pre"> </span>case action_demoted:<br> <span class="Apple-tab-span" style="white-space:pre"> </span><span class="Apple-tab-span" style="white-space:pre"> </span><span class="Apple-tab-span" style="white-space:pre"> </span>result = CRMD_ACTION_DEMOTED;<br>+<span class="Apple-tab-span" style="white-space:pre"> </span><span class="Apple-tab-span" style="white-space:pre"> </span><span class="Apple-tab-span" style="white-space:pre"> </span>break;<br>+<span class="Apple-tab-span" style="white-space:pre"> </span><span class="Apple-tab-span" style="white-space:pre"> </span>case monitor_loop_rsc:<br>+<span class="Apple-tab-span" style="white-space:pre"> </span><span class="Apple-tab-span" style="white-space:pre"> </span><span class="Apple-tab-span" style="white-space:pre"> </span>result = CRMD_ACTION_STATUS_LOOP;<br> <span class="Apple-tab-span" style="white-space:pre"> </span><span class="Apple-tab-span" style="white-space:pre"> </span><span class="Apple-tab-span" style="white-space:pre"> </span>break;<br> <span class="Apple-tab-span" style="white-space:pre"> </span>}<br> <span class="Apple-tab-span" style="white-space:pre"> </span><br>diff -r 0f6fc6f8c01f pengine/group.c<br>--- a/pengine/group.c<span class="Apple-tab-span" style="white-space:pre"> </span>Mon Oct 06 18:27:13 2008 +0200<br>+++ b/pengine/group.c<span class="Apple-tab-span" style="white-space:pre"> </span>Tue Oct 07 17:43:57 2008 +0900<br>@@ -431,6 +431,7 @@<br> <span class="Apple-tab-span" style="white-space:pre"> </span> switch(task) {<br> <span class="Apple-tab-span" style="white-space:pre"> </span><span class="Apple-tab-span" style="white-space:pre"> </span>case no_action:<br> <span class="Apple-tab-span" style="white-space:pre"> </span><span class="Apple-tab-span" style="white-space:pre"> </span>case monitor_rsc:<br>+<span class="Apple-tab-span" style="white-space:pre"> </span><span class="Apple-tab-span" style="white-space:pre"> </span>case monitor_loop_rsc:<br> <span class="Apple-tab-span" style="white-space:pre"> </span><span class="Apple-tab-span" style="white-space:pre"> </span>case action_notify:<br> <span class="Apple-tab-span" style="white-space:pre"> </span><span class="Apple-tab-span" style="white-space:pre"> </span>case action_notified:<br> <span class="Apple-tab-span" style="white-space:pre"> </span><span class="Apple-tab-span" style="white-space:pre"> </span>case shutdown_crm:<br>diff -r 0f6fc6f8c01f pengine/utils.c<br>--- a/pengine/utils.c<span class="Apple-tab-span" style="white-space:pre"> </span>Mon Oct 06 18:27:13 2008 +0200<br>+++ b/pengine/utils.c<span class="Apple-tab-span" style="white-space:pre"> </span>Tue Oct 07 17:43:57 2008 +0900<br>@@ -335,6 +335,7 @@<br> <span class="Apple-tab-span" style="white-space:pre"> </span><span class="Apple-tab-span" style="white-space:pre"> </span><span class="Apple-tab-span" style="white-space:pre"> </span>task--;<br> <span class="Apple-tab-span" style="white-space:pre"> </span><span class="Apple-tab-span" style="white-space:pre"> </span><span class="Apple-tab-span" style="white-space:pre"> </span>break;<br> <span class="Apple-tab-span" style="white-space:pre"> </span><span class="Apple-tab-span" style="white-space:pre"> </span>case monitor_rsc:<br>+<span class="Apple-tab-span" style="white-space:pre"> </span><span class="Apple-tab-span" style="white-space:pre"> </span>case monitor_loop_rsc:<br> <span class="Apple-tab-span" style="white-space:pre"> </span><span class="Apple-tab-span" style="white-space:pre"> </span>case shutdown_crm:<br> <span class="Apple-tab-span" style="white-space:pre"> </span><span class="Apple-tab-span" style="white-space:pre"> </span>case stonith_node:<br> <span class="Apple-tab-span" style="white-space:pre"> </span><span class="Apple-tab-span" style="white-space:pre"> </span><span class="Apple-tab-span" style="white-space:pre"> </span>task = no_action;<br>diff -r 83a87f2b6554 resources/OCF/Dummy<br>--- a/resources/OCF/Dummy<span class="Apple-tab-span" style="white-space:pre"> </span>Sat Oct 04 15:54:26 2008 +0200<br>+++ b/resources/OCF/Dummy<span class="Apple-tab-span" style="white-space:pre"> </span>Tue Oct 07 19:11:31 2008 +0900<br>@@ -142,6 +142,7 @@<br> start)<span class="Apple-tab-span" style="white-space:pre"> </span><span class="Apple-tab-span" style="white-space:pre"> </span>dummy_start;;<br> stop)<span class="Apple-tab-span" style="white-space:pre"> </span><span class="Apple-tab-span" style="white-space:pre"> </span>dummy_stop;;<br> monitor)<span class="Apple-tab-span" style="white-space:pre"> </span>dummy_monitor;;<br>+monitor-loop)<span class="Apple-tab-span" style="white-space:pre"> </span>ocf_monitor_loop ${OCF_RESKEY_maxfailures};;<br> migrate_to)<span class="Apple-tab-span" style="white-space:pre"> </span>ocf_log info "Migrating ${OCF_RESOURCE_INSTANCE} to ${OCF_RESKEY_CRM_meta_migrate_to}."<br> <span class="Apple-tab-span" style="white-space:pre"> </span> dummy_stop<br> <span class="Apple-tab-span" style="white-space:pre"> </span><span class="Apple-tab-span" style="white-space:pre"> </span>;;<br>_______________________________________________<br>Pacemaker mailing list<br><a href="mailto:Pacemaker@clusterlabs.org">Pacemaker@clusterlabs.org</a><br>http://list.clusterlabs.org/mailman/listinfo/pacemaker<br></blockquote></div><br></div></div></body></html>