<html><head></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; ">Sorry for the double-post--- I find pinging the network gateway (192.168.1.1) works better actually. Otherwise, the nodes will have equal pingd scores as the pingd resource is cloned.<div><br><div><div>On May 18, 2011, at 2:45 PM, Daniel Bozeman wrote:</div><br class="Apple-interchange-newline"><blockquote type="cite"><div style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; ">Here is my solution for others to reference. It may not be ideal or possible for everyone, and I am up for suggestions.<div><br></div><div>I've got two machines connected via crossover (will be two crossovers for redundancy in production) with static IPs. Corosync communicates over this network. Then each machine is connected to the main network (.1.77 and .1.78)</div><div><br></div><div>This way, the machines can continue to communicate with one-another despite a network failure affecting one machine and react appropriately.</div><div><br></div><div>Using postgres as a test resource, I have the following (desired) results:</div><div><br></div><div>The primary node loses network connectivity and postgres is fired up on the other. When the former primary node regains connectivity, the process does not failback nor does it restart.</div><div><br></div><div>Please see my configuration below</div><div><br></div><div><div>node postmaster</div><div>node postslave</div><div>primitive pingd ocf:pacemaker:pingd \</div><div> params host_list="192.168.1.77 192.168.1.78" multiplier="100" \</div><div> op monitor interval="15s" timeout="5s"</div><div>primitive postgres lsb:postgresql \</div><div> op monitor interval="20s"</div><div>clone pingdclone pingd \</div><div> meta globally-unique="false"</div><div>location postgres_location postgres \</div><div> rule $id="postgres_location-rule" pingd: defined pingd</div><div>property $id="cib-bootstrap-options" \</div><div> dc-version="1.0.8-042548a451fce8400660f6031f4da6f0223dd5dd" \</div><div> cluster-infrastructure="openais" \</div><div> expected-quorum-votes="2" \</div><div> stonith-enabled="false" \</div><div> no-quorum-policy="ignore" \</div><div> last-lrm-refresh="1305736421"</div></div><div><br></div><div>Naturally, this is a very simple configuration that only tests network failure failover and failback prevention.</div><div><br></div><div>Are there any downsides to my method? I'd love to hear feedback. Thank you all for your help. "on-fail=standby" did absolutely nothing for me by the way.</div><div><br><div><div>On May 18, 2011, at 9:16 AM, Daniel Bozeman wrote:</div><br class="Apple-interchange-newline"><blockquote type="cite"><div style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; ">I was originally using heartbeat and my original config as I mentioned in my first post, but I moved on to set up a config identical to that in the documentation for troubleshooting.<div><br></div><div>Why is the "on-fail=standby" not optimal? I have tried this in the past but it did not help. As far as I can tell, pacemaker does not consider a loss of network connectivity a failure on the part of the server itself or any of its resources. As I've said, everything works fine should I kill a process, kill corosync, etc.</div><div><br></div><div>I think this may be what I am looking for:</div><div><br></div><div><a href="http://www.clusterlabs.org/wiki/Example_configurations#Set_up_pingd">http://www.clusterlabs.org/wiki/Example_configurations#Set_up_pingd</a></div><div><br></div><div>But I am still having issues. How can I reset the scores once the node has been recovered? Is there some sort of "score reset" command? Once the node is set to -INF as this example shows, nothing is going to return to it.</div><div><br></div><div>Thank you all for your help</div><div><br><div><div>On May 18, 2011, at 4:02 AM, Dan Frincu wrote:</div><br class="Apple-interchange-newline"><blockquote type="cite">Hi,<br><br><div class="gmail_quote">On Wed, May 18, 2011 at 11:30 AM, Max Williams <span dir="ltr"><<a href="mailto:Max.Williams@betfair.com">Max.Williams@betfair.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
<div lang="EN-GB" link="blue" vlink="purple"><div><p class="MsoNormal"><span style="font-size:11.0pt;color:#1F497D">Hi Daniel,</span></p><p class="MsoNormal"><span style="font-size:11.0pt;color:#1F497D">You might want to set “on-fail=standby” for the resource group or individual resources. This will put the host in to standby when a failure occurs thus preventing failback:</span></p>
</div></div></blockquote><div><br></div><div>This is not the most optimal solution.</div><div> </div><blockquote class="gmail_quote" style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0.8ex; border-left-width: 1px; border-left-color: rgb(204, 204, 204); border-left-style: solid; padding-left: 1ex; position: static; z-index: auto; "><div lang="EN-GB" link="blue" vlink="purple">
<div><p class="MsoNormal"><span style="font-size:11.0pt;color:#1F497D"><a href="http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-resource-operations.html#s-resource-failure" target="_blank">http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-resource-operations.html#s-resource-failure</a></span></p><div><span style="font-size:11.0pt;color:#1F497D"> </span><br class="webkit-block-placeholder"></div><p class="MsoNormal"><span style="font-size:11.0pt;color:#1F497D">Another option is to set resource stickiness which will stop resources moving back after a failure:</span></p><p class="MsoNormal"><span style="font-size:11.0pt;color:#1F497D"><a href="http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Clusters_from_Scratch/ch05s03s02.html" target="_blank">http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Clusters_from_Scratch/ch05s03s02.html</a></span></p>
</div></div></blockquote><div><br></div><div>That is set globally in his config.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"><div lang="EN-GB" link="blue" vlink="purple">
<div><div><span style="font-size:11.0pt;color:#1F497D"> </span><br class="webkit-block-placeholder"></div><p class="MsoNormal"><span style="font-size:11.0pt;color:#1F497D">Also note if you are using a two node cluster you will also need the property “no-quorum-policy=ignore” set.</span></p>
</div></div></blockquote><div><br></div><div>This as well.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"><div lang="EN-GB" link="blue" vlink="purple">
<div><div><span style="font-size:11.0pt;color:#1F497D"> </span><br class="webkit-block-placeholder"></div><p class="MsoNormal"><span style="font-size:11.0pt;color:#1F497D">Hope that helps!</span></p><p class="MsoNormal"><span style="font-size:11.0pt;color:#1F497D">Cheers,</span></p><p class="MsoNormal"><span style="font-size:11.0pt;color:#1F497D">Max</span></p><div><span style="font-size:11.0pt;color:#1F497D"> </span><br class="webkit-block-placeholder"></div><div><div style="border:none;border-top:solid #B5C4DF 1.0pt;padding:3.0pt 0cm 0cm 0cm"><p class="MsoNormal"><b><span lang="EN-US" style="font-size:10.0pt">From:</span></b><span lang="EN-US" style="font-size:10.0pt"> Daniel Bozeman [mailto:<a href="mailto:daniel.bozeman@americanroamer.com" target="_blank">daniel.bozeman@americanroamer.com</a>] <br>
<b>Sent:</b> 17 May 2011 19:09<br><b>To:</b> <a href="mailto:pacemaker@oss.clusterlabs.org" target="_blank">pacemaker@oss.clusterlabs.org</a><br><b>Subject:</b> Re: [Pacemaker] Preventing auto-fail-back</span></p></div></div>
<div><div></div><div class="h5"><div> <br class="webkit-block-placeholder"></div><p class="MsoNormal">To be more specific:</p><div><div> <br class="webkit-block-placeholder"></div></div><div><p class="MsoNormal">I've tried following the example on page 25/26 of this document to the teeth: <a href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf" target="_blank">http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf</a></p>
</div></div></div></div></div></blockquote><div><br></div><div>Well, not really, that's why there are errors in your config.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
<div lang="EN-GB" link="blue" vlink="purple"><div><div><div class="h5"><div><div> <br class="webkit-block-placeholder"></div></div><div><p class="MsoNormal">And it does work as advertised. When I stop corosync, the resource goes to the other node. I start corosync and it remains there as it should.</p>
</div><div><div> <br class="webkit-block-placeholder"></div></div><div><p class="MsoNormal">However, if I simply unplug the ethernet connection, let the resource migrate, then plug it back in, it will fail back to the original node. Is this the intended behavior? It seems a bad NIC could wreck havoc on such a setup.</p>
</div><div><div> <br class="webkit-block-placeholder"></div></div><div><p class="MsoNormal">Thanks!</p></div><div><div> <br class="webkit-block-placeholder"></div></div><div><p class="MsoNormal">Daniel</p></div><div><div> <br class="webkit-block-placeholder"></div></div><div><div><p class="MsoNormal">
On May 16, 2011, at 5:33 PM, Daniel Bozeman wrote:</p></div><p class="MsoNormal"><br><br></p><div><p class="MsoNormal">For the life of me, I cannot prevent auto-failback from occurring in a master-slave setup I have in virtual machines. I have a very simple configuration:<br>
<br>node $id="4fe75075-333c-4614-8a8a-87149c7c9fbb" ha2 \<br> attributes standby="off"<br>node $id="70718968-41b5-4aee-ace1-431b5b65fd52" ha1 \<br> attributes standby="off"<br>
primitive FAILOVER-IP ocf:heartbeat:IPaddr \<br> params ip="192.168.1.79" \<br> op monitor interval="10s"<br>primitive PGPOOL lsb:pgpool2 \<br> op monitor interval="10s"<br>
group PGPOOL-AND-IP FAILOVER-IP PGPOOL<br>colocation IP-WITH-PGPOOL inf: FAILOVER-IP PGPOOL<br>property $id="cib-bootstrap-options" \<br> dc-version="1.0.8-042548a451fce8400660f6031f4da6f0223dd5dd" \<br>
</p></div></div></div></div></div></div></blockquote><div><br></div><div>Change to cluster-infrastructure="openais"</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
<div lang="EN-GB" link="blue" vlink="purple"><div><div><div class="h5"><div><div><p class="MsoNormal"> cluster-infrastructure="Heartbeat" \<br> stonith-enabled="false" \<br> no-quorum-policy="ignore"<br>
</p></div></div></div></div></div></div></blockquote><div><br></div><div>You're missing expected-quorum-votes here, it should be expected-quorum-votes="2" and it's usually added automatically when the nodes are added/seen to/by the cluster, I assume it's related to the cluster-infrastructure="Heartbeat".</div>
<div><br></div><div>Regards,</div><div>Dan</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"><div lang="EN-GB" link="blue" vlink="purple"><div><div><div class="h5">
<div><div><p class="MsoNormal">rsc_defaults $id="rsc-options" \<br> resource-stickiness="1000"<br><br>No matter what I do with resource stickiness, I cannot prevent fail-back. I usually don't have a problem with failback when I restart the current master, but when I disable network connectivity to the master, everything fails over fine. Then I enable the network adapter and everything jumps back to the original "failed" node. I've done some "watch ptest -Ls"ing, and the scores seem to signify that failback should not occur. I'm also seeing resources bounce more times than necessary when a node is added (~3 times each) and resources seem to bounce when a node returns to the cluster even if it isn't necessary for them to do so. I also had an order directive in my configuration at one time, and often the second resource would start, then stop, then allow the first resource to start, then start itself. Quite weird. Any nods in the right direction would be greatly appreciated. I've scoured Google and read the official documentation to no avail. I suppose I should mention I am using heartbeat as well. My LSB resource implements start/stop/status properly without error.<br>
<br>I've been testing this with a floating IP + Postgres as well with the same issues. One thing I notice is that my "group" resources have no score. Is this normal? There doesn't seem to be any way to assign a stickiness to a group, and default stickiness has no effect.<br>
<br>Thanks!<br><br>Daniel Bozeman</p></div></div><div> <br class="webkit-block-placeholder"></div><div><p class="MsoNormal"><span><span style="font-size:13.5pt;color:black">Daniel Bozeman</span></span><span style="font-size:13.5pt;color:black"><br>
<span>American Roamer</span><br><span>Systems Administrator</span><br><span><a href="mailto:daniel.bozeman@americanroamer.com" target="_blank">daniel.bozeman@americanroamer.com</a></span></span> </p></div><div>
<br class="webkit-block-placeholder"></div></div></div></div><br>
________________________________________________________________________<br>
In order to protect our email recipients, Betfair Group use SkyScan from <br>
MessageLabs to scan all Incoming and Outgoing mail for viruses.<br>
<br>
________________________________________________________________________<br>
</div><br>_______________________________________________<br>
Pacemaker mailing list: <a href="mailto:Pacemaker@oss.clusterlabs.org">Pacemaker@oss.clusterlabs.org</a><br>
<a href="http://oss.clusterlabs.org/mailman/listinfo/pacemaker" target="_blank">http://oss.clusterlabs.org/mailman/listinfo/pacemaker</a><br>
<br>
Project Home: <a href="http://www.clusterlabs.org/" target="_blank">http://www.clusterlabs.org</a><br>
Getting started: <a href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf" target="_blank">http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf</a><br>
Bugs: <a href="http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker" target="_blank">http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker</a><br>
<br></blockquote></div><br><br clear="all"><br>-- <br>Dan Frincu<div>CCNA, RHCE</div><br>
_______________________________________________<br>Pacemaker mailing list: <a href="mailto:Pacemaker@oss.clusterlabs.org">Pacemaker@oss.clusterlabs.org</a><br><a href="http://oss.clusterlabs.org/mailman/listinfo/pacemaker">http://oss.clusterlabs.org/mailman/listinfo/pacemaker</a><br><br>Project Home: <a href="http://www.clusterlabs.org/">http://www.clusterlabs.org</a><br>Getting started: <a href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf">http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf</a><br>Bugs: <a href="http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker">http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker</a><br></blockquote></div><br><div>
Daniel Bozeman<br>American Roamer<br>Systems Administrator<br><a href="mailto:daniel.bozeman@americanroamer.com">daniel.bozeman@americanroamer.com</a>
</div>
<br></div></div>_______________________________________________<br>Pacemaker mailing list: <a href="mailto:Pacemaker@oss.clusterlabs.org">Pacemaker@oss.clusterlabs.org</a><br><a href="http://oss.clusterlabs.org/mailman/listinfo/pacemaker">http://oss.clusterlabs.org/mailman/listinfo/pacemaker</a><br><br>Project Home: <a href="http://www.clusterlabs.org">http://www.clusterlabs.org</a><br>Getting started: <a href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf">http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf</a><br>Bugs: <a href="http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker">http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker</a><br></blockquote></div><br><div>
Daniel Bozeman<br>American Roamer<br>Systems Administrator<br><a href="mailto:daniel.bozeman@americanroamer.com">daniel.bozeman@americanroamer.com</a>
</div>
<br></div></div>_______________________________________________<br>Pacemaker mailing list: <a href="mailto:Pacemaker@oss.clusterlabs.org">Pacemaker@oss.clusterlabs.org</a><br><a href="http://oss.clusterlabs.org/mailman/listinfo/pacemaker">http://oss.clusterlabs.org/mailman/listinfo/pacemaker</a><br><br>Project Home: http://www.clusterlabs.org<br>Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf<br>Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker<br></blockquote></div><br><div>
<span class="Apple-style-span" style="border-collapse: separate; color: rgb(0, 0, 0); font-family: Helvetica; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: 2; text-align: auto; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-border-horizontal-spacing: 0px; -webkit-border-vertical-spacing: 0px; -webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; font-size: medium; ">Daniel Bozeman<br>American Roamer<br>Systems Administrator<br><a href="mailto:daniel.bozeman@americanroamer.com">daniel.bozeman@americanroamer.com</a></span>
</div>
<br></div></body></html>