Hello,<br><br><div class="gmail_quote">On Mon, Aug 22, 2011 at 2:55 AM, ihjaz Mohamed <span dir="ltr"><<a href="mailto:ihjazmohamed@yahoo.co.in">ihjazmohamed@yahoo.co.in</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
<table cellspacing="0" cellpadding="0" border="0"><tbody><tr><td valign="top" style="font:inherit">Hi,<br><br>Has any one here come across this issue?.<br><br></td></tr></tbody></table></blockquote><div><br></div><div><br>
</div><div>Sorry for the delay, but I wanted to respond and let you know that I'm also having this issue. I can pretty reliably kill a pretty simple cluster setup by rebooting one of the nodes. When the rebooted node comes back up and starts pacemaker, it instantly tries to start all services on itself, ignoring that they're running happily and healthily on the other node and resource stickiness is configured at 1000. The result is none of the resources running anywhere... they become unmanaged and crm status shows that it thinks they are running on the freshly rebooted node. If pacemaker can be configured for a delay on startup before it tries to run services, I think even 5 seconds of time would be enough for it to realize that it should definitely not start anything at all. I haven't been able to find a setting that accomplishes that, though.</div>
<div><br></div><div>The cluster is a pretty simple one, trying to test the VirtualDomain RA which in and of itself has given me fits (empty state files... why do they get emptied rather than removed, which prevents the VM from starting until you manually re-populate the state file no matter how many 'resource cleanup' attempts you make?), but that is for another troubleshooting session. This problem is my biggie, because a healthy surviving node has all resources forced off of it and killed by a rebooted one.</div>
<div><br></div><div>Has anybody else been running into this, or are we just two unlucky fellas?</div><div><br></div><div>This is currently on CentOS 6.0 with all updates (had the same issue on Scientific Linux 6.1 so rolled back onto CentOS for consistency since all other machines here are on it). Both 'cman' and 'pacemaker' configured to start at boot. I'll throw cluster.conf and 'crm configure show' output on the end of this in case it'll help someone spot a glaring mistake on my part (which I'd love it to be at this point, as that is easily fixed).</div>
<div><br></div><div>Regards,</div><div>Mark</div><div><br></div><div><br></div><div><div><?xml version="1.0"?></div><div><cluster config_version="1" name="KVMCluster"></div><div> <fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/></div>
<div> <clusternodes></div><div> <clusternode name="kvm1" nodeid="1" votes="1"></div><div> <fence/></div><div> </clusternode></div>
<div> <clusternode name="kvm2" nodeid="2" votes="1"></div><div> <fence/></div><div> </clusternode></div><div> </clusternodes></div>
<div> <cman/></div><div> <fencedevices/></div><div> <rm/></div><div></cluster></div></div><div><br></div><div><br></div><div><div>node kvm1</div><div>node kvm2</div><div>primitive apache1 ocf:heartbeat:VirtualDomain \</div>
<div><span class="Apple-tab-span" style="white-space:pre"> </span>params config="/etc/libvirt/qemu/apache1.xml" \</div><div><span class="Apple-tab-span" style="white-space:pre"> </span>meta allow-migrate="true" is-managed="true" \</div>
<div><span class="Apple-tab-span" style="white-space:pre"> </span>op start interval="0" timeout="90" \</div><div><span class="Apple-tab-span" style="white-space:pre"> </span>op stop interval="0" timeout="90" \</div>
<div><span class="Apple-tab-span" style="white-space:pre"> </span>op migrate_to interval="0" timeout="120" \</div><div><span class="Apple-tab-span" style="white-space:pre"> </span>op migrate_from interval="0" timeout="60"</div>
<div>primitive fw1 ocf:heartbeat:VirtualDomain \</div><div><span class="Apple-tab-span" style="white-space:pre"> </span>params config="/etc/libvirt/qemu/fw1.xml" \</div><div><span class="Apple-tab-span" style="white-space:pre"> </span>meta allow-migrate="true" is-managed="true" \</div>
<div><span class="Apple-tab-span" style="white-space:pre"> </span>op start interval="0" timeout="90" \</div><div><span class="Apple-tab-span" style="white-space:pre"> </span>op stop interval="0" timeout="90" \</div>
<div><span class="Apple-tab-span" style="white-space:pre"> </span>op migrate_to interval="0" timeout="120" \</div><div><span class="Apple-tab-span" style="white-space:pre"> </span>op migrate_from interval="0" timeout="60"</div>
<div>primitive vgClusterDisk ocf:heartbeat:LVM \</div><div><span class="Apple-tab-span" style="white-space:pre"> </span>params volgrpname="vgClusterDisk" \</div><div><span class="Apple-tab-span" style="white-space:pre"> </span>op start interval="0" timeout="30" \</div>
<div><span class="Apple-tab-span" style="white-space:pre"> </span>op stop interval="0" timeout="120"</div><div>clone shared_volgrp vgClusterDisk \</div><div><span class="Apple-tab-span" style="white-space:pre"> </span>meta target-role="Started" is-managed="true"</div>
<div>order storage_then_VMs inf: shared_volgrp ( fw1 apache1 )</div><div>property $id="cib-bootstrap-options" \</div><div><span class="Apple-tab-span" style="white-space:pre"> </span>dc-version="1.1.2-f059ec7ced7a86f18e5490b67ebf4a0b963bccfe" \</div>
<div><span class="Apple-tab-span" style="white-space:pre"> </span>cluster-infrastructure="cman" \</div><div><span class="Apple-tab-span" style="white-space:pre"> </span>no-quorum-policy="ignore" \</div>
<div><span class="Apple-tab-span" style="white-space:pre"> </span>stonith-action="reboot" \</div><div><span class="Apple-tab-span" style="white-space:pre"> </span>stonith-timeout="30s" \</div><div><span class="Apple-tab-span" style="white-space:pre"> </span>maintenance-mode="false" \</div>
<div><span class="Apple-tab-span" style="white-space:pre"> </span>pe-error-series-max="5000" \</div><div><span class="Apple-tab-span" style="white-space:pre"> </span>pe-warn-series-max="5000" \</div><div>
<span class="Apple-tab-span" style="white-space:pre"> </span>pe-input-series-max="5000" \</div><div><span class="Apple-tab-span" style="white-space:pre"> </span>dc-deadtime="2min" \</div><div><span class="Apple-tab-span" style="white-space:pre"> </span>stonith-enabled="false" \</div>
<div><span class="Apple-tab-span" style="white-space:pre"> </span>last-lrm-refresh="1314294568"</div><div>rsc_defaults $id="rsc-options" \</div><div><span class="Apple-tab-span" style="white-space:pre"> </span>resource-stickiness="1000"</div>
</div><div><br></div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"><table cellspacing="0" cellpadding="0" border="0"><tbody><tr><td valign="top" style="font:inherit">
--- On <b>Wed, 17/8/11, ihjaz Mohamed <i><<a href="mailto:ihjazmohamed@yahoo.co.in" target="_blank">ihjazmohamed@yahoo.co.in</a>></i></b> wrote:<br><blockquote style="border-left:2px solid rgb(16, 16, 255);margin-left:5px;padding-left:5px">
<br>From: ihjaz Mohamed <<a href="mailto:ihjazmohamed@yahoo.co.in" target="_blank">ihjazmohamed@yahoo.co.in</a>><br>Subject: [Pacemaker] How to prevent a node that joins the cluster after reboot from starting the resources.<br>
To: <a href="mailto:pacemaker@oss.clusterlabs.org" target="_blank">pacemaker@oss.clusterlabs.org</a><br>Date: Wednesday, 17 August, 2011, 12:23 PM<br><br><div><table border="0" cellpadding="0" cellspacing="0"><tbody><tr><td style="font:inherit" valign="top">
<div><table border="0" cellpadding="0" cellspacing="0"><tbody><tr><td style="font-style:inherit;font-variant:inherit;font-weight:inherit;line-height:inherit;font-size-adjust:inherit;font-stretch:inherit;font-family:arial;font-size:10pt">
<span style="margin:1em 0pt 0pt"><span><span><span><span style="border-left-width:3px"><span style="display:block">Hi All,
<br><br>Am getting an unmanaged error as shown below when one of the node is rebooted
and comes back to join the cluster. <br><br><i>Online: [ <a href="http://aceblr101.com" target="_blank">aceblr101.com</a> <a href="http://aceblr107.com" target="_blank">aceblr107.com</a> ]
<br><br> Resource Group: HAService
<br> FloatingIP (ocf::heartbeat:IPaddr2): Started <a href="http://aceblr107.com" target="_blank">aceblr107.com</a> (unmanaged) FAILED
<br> acestatus (lsb:acestatus): Stopped
<br> Clone Set: pingdclone
<br> Started: [ <a href="http://aceblr101.com" target="_blank">aceblr101.com</a> <a href="http://aceblr107.com" target="_blank">aceblr107.com</a> ]
<br><br>Failed actions:
<br> FloatingIP_stop_0 (node=<a href="http://aceblr107.com" target="_blank">aceblr107.com</a>, call=7, rc=1, status=complete): unknown error<br></i></span></span></span></span></span></span><span style="margin:1em 0pt 0pt"><span><span><span><span style="border-left-width:3px"><span style="display:block"><span style="margin:1em 0pt 0pt"><span><span><span><span style="border-left-width:3px"><span style="display:block">Below is my configuration<span style="font-style:italic">:</span></span></span></span></span></span></span><i><span style="margin:1em 0pt 0pt"><span><span><span><span style="border-left-width:3px"><span style="display:block">node $id="8bf8e613-f63c-43a6-8915-4b2dbf72a4a5" <a href="http://aceblr101.com" target="_blank">aceblr101.com</a>
<br>node $id="bde62a1f-0f29-4357-a988-0e26bb06c4fb" <a href="http://aceblr107.com" target="_blank">aceblr107.com</a>
<br>primitive FloatingIP ocf:heartbeat:IPaddr2 \
<br> params ip="xx.xxx.xxx.xxx" nic="eth0:0"
<br>primitive acestatus lsb:acestatus \
<br> op start interval="30"
<br>primitive pingd ocf:pacemaker:pingd \
<br> params host_list="xx.xxx.xxx.1" multiplier="100" \
<br> op monitor interval="15s" timeout="5s"
<br>group HAService FloatingIP acestatus \
<br> meta target-role="Started"
<br>clone pingdclone pingd \
<br> meta globally-unique="false"
<br>location ip1_location FloatingIP \
<br> rule $id="ip1_location-rule" pingd: defined pingd
<br>property $id="cib-bootstrap-options" \
<br> dc-version="1.0.11-1554a83db0d3c3e546cfd3aaff6af1184f79ee87" \
<br> cluster-infrastructure="Heartbeat" \
<br> expected-quorum-votes="2" \
<br> stonith-enabled="false" \
<br> no-quorum-policy="ignore" \
<br> last-lrm-refresh="1305736421"</span></span></span></span></span></span></i></span></span></span></span></span></span><span style="margin:1em 0pt 0pt"><span><span><span><span style="border-left-width:3px"><span style="display:block"><span style="margin:1em 0pt 0pt"><span><span><span><span style="border-left-width:3px"><span style="display:block"><span style="margin:1em 0pt 0pt"><span><span><span><span style="border-left-width:3px"><span style="display:block">I
see from the logs that when the rebooted node comes back and joins the
cluster, the resources on that node is getting started even though the
resources are started on the existing node.
<br><br>When resources on both nodes are started it tries to stop it on one of the node which fails and goes to unmanaged mode.
<br><br>Could anyone help me on how I should configure so that the
resources are not started on the new node that joins the cluster after a
reboot when it is already started on the existing node.<br></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></span></td></tr></tbody></table></div></td>
</tr></tbody></table></div><br>-----Inline Attachment Follows-----<br><br><div>_______________________________________________<br>Pacemaker mailing list: <a href="http://mc/compose?to=Pacemaker@oss.clusterlabs.org" target="_blank">Pacemaker@oss.clusterlabs.org</a><br>
<a href="http://oss.clusterlabs.org/mailman/listinfo/pacemaker" target="_blank">http://oss.clusterlabs.org/mailman/listinfo/pacemaker</a><br><br>Project Home: <a href="http://www.clusterlabs.org" target="_blank">http://www.clusterlabs.org</a><br>
Getting started: <a href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf" target="_blank">http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf</a><br>Bugs: <a href="http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker" target="_blank">http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker</a><br>
</div></blockquote></td></tr></tbody></table><br>_______________________________________________<br>
Pacemaker mailing list: <a href="mailto:Pacemaker@oss.clusterlabs.org">Pacemaker@oss.clusterlabs.org</a><br>
<a href="http://oss.clusterlabs.org/mailman/listinfo/pacemaker" target="_blank">http://oss.clusterlabs.org/mailman/listinfo/pacemaker</a><br>
<br>
Project Home: <a href="http://www.clusterlabs.org" target="_blank">http://www.clusterlabs.org</a><br>
Getting started: <a href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf" target="_blank">http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf</a><br>
Bugs: <a href="http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker" target="_blank">http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker</a><br>
<br></blockquote></div><br>