Hi,<br><br><div class="gmail_quote">On Mon, Oct 24, 2011 at 9:52 AM, Alan Robertson <span dir="ltr"><alanr@unix.sh></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

<u></u>

  <div text="#000000" bgcolor="#ffffff">

    Setting no-quorum-policy to ignore and disabling stonith is not a

    good idea.  You're sort of inviting the cluster to do screwed up

    things.<div><div></div><div class="h5"><br>

    <br></div></div></div></blockquote><div><br></div><div>Isn't "no-quorum-policy ignore" sort of required for a two-node cluster?  Without it, all services stop when one of your nodes gets taken offline, which is definitely not what you want.  You can use "freeze" instead, but then the resources for the downed node don't get started on the surviving one.</div>

<div><br></div><div>The problem he's running into sounds like one I posted a question on a while back, where a node returning to the cluster doesn't wait to see if services are running elsewhere, instead it instantly tries to start all services on itself the second corosync launches, even though they're already started, leading to what his output shows, services in a started/unmanaged state.  I had this while running on CentOS 6 and Scientific Linux 6.1 using pretty much stock corosync.conf files (just adjusted for network addresses).  I rebuilt the nodes with Debian for other reasons (Xen support and familiarity) and as a nice side effect, that problem disappeared.</div>

<div><br></div><div>Mark</div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"><div text="#000000" bgcolor="#ffffff"><div><div class="h5">

    <br>

    On 10/24/2011 08:23 AM, ihjaz Mohamed wrote:

    </div></div><blockquote type="cite"><div><div></div><div class="h5">

      <div style="color:rgb(0, 0, 0);background-color:rgb(255, 255, 255);font-family:verdana,helvetica,sans-serif;font-size:14pt">

        <div><font size="3">Hi All,</font></div>

        <div><font size="2"><br>

          </font></div>

        <div><font size="3">I 've pacemaker running with corosync.

            Following is my </font><font size="3">CRM configuration.</font></div>

        <div><br>

        </div>

        <div> <font size="2">node soalaba56<br>

            node soalaba63<br>

            primitive FloatingIP ocf:heartbeat:IPaddr2 \<br>

                    params ip="<floating_ip>" nic="eth0:0"<br>

            primitive acestatus lsb:acestatus \<br>

            primitive pingd ocf:pacemaker:ping \<br>

                    params host_list="<gateway_ip>"

            multiplier="100" \<br>

                    op monitor interval="15s" timeout="5s"<br>

            group HAService FloatingIP acestatus \<br>

                    meta target-role="Started"<br>

            clone pingdclone pingd \<br>

                    meta globally-unique="false"<br>

            location ip1_location FloatingIP \<br>

                    rule $id="ip1_location-rule" pingd: defined pingd<br>

            property $id="cib-bootstrap-options" \<br>

            dc-version="1.1.5-5.el6-01e86afaaa6d4a8c4836f68df80ababd6ca3902f"

            \<br>

                    cluster-infrastructure="openais" \<br>

                    expected-quorum-votes="2" \<br>

                    stonith-enabled="false" \<br>

                    no-quorum-policy="ignore" \<br>

                    last-lrm-refresh="1305736421"</font></div>

        <div><font size="2">----------------------------------------------------------------------</font></div>

        <div><br>

        </div>

        <div><font size="3">When I reboot both the nodes together,

            cluster goes into an (unmanaged) Failed state as shown

            below.</font></div>

        <div><font size="2"><br>

          </font></div>

        <div><br>

        </div>

        <div><font size="2">============<br>

            Last updated: Mon Oct 24 08:10:42 2011<br>

            Stack: openais<br>

            Current DC: soalaba63 - partition with quorum<br>

            Version:

            1.1.5-5.el6-01e86afaaa6d4a8c4836f68df80ababd6ca3902f<br>

            2 Nodes configured, 2 expected votes<br>

            2 Resources configured.<br>

            ============<br>

            <br>

            Online: [ soalaba56 soalaba63 ]<br>

            <br>

             Resource Group: HAService<br>

                 FloatingIP (ocf::heartbeat:IPaddr2) Started 

            (unmanaged) FAILED[   soalaba63       soalaba56 ]<br>

                 acestatus  (lsb:acestatus):        Stopped<br>

             Clone Set: pingdclone [pingd]<br>

                 Started: [ soalaba56 soalaba63 ]<br>

            <br>

            Failed actions:<br>

                FloatingIP_stop_0 (node=soalaba63, call=7, rc=1,

            status=complete): unknown error<br>

                FloatingIP_stop_0 (node=soalaba56, call=7, rc=1,

            status=complete): unknown error<br>

          </font></div>

        <div><font size="2">------------------------------------------------------------------------------<br>

          </font></div>

        <div><br>

        </div>

        <div><font size="3">This happens only when the reboot is done

            simultaneously on both the nodes. If reboot is done with

            some interval in between this is not seen. Looking into the

            logs I see that  when the nodes come up resources are

            started on both the nodes and then it tries to stop the

            started resources and fails there. </font></div>

        <div><font size="3"><br>

          </font></div>

        <div><font size="3">I've attached the logs.</font><br>

        </div>

        <div><font size="2"><br>

          </font></div>

        <div><font size="2"><br>

          </font></div>

      </div>

      </div></div><pre><fieldset></fieldset>

_______________________________________________

Pacemaker mailing list: <a href="mailto:Pacemaker@oss.clusterlabs.org" target="_blank">Pacemaker@oss.clusterlabs.org</a>

<a href="http://oss.clusterlabs.org/mailman/listinfo/pacemaker" target="_blank">http://oss.clusterlabs.org/mailman/listinfo/pacemaker</a>

Project Home: <a href="http://www.clusterlabs.org" target="_blank">http://www.clusterlabs.org</a>

Getting started: <a href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf" target="_blank">http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf</a>

Bugs: <a href="http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker" target="_blank">http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker</a>

</pre>

    </blockquote><font color="#888888">

    <br>

    <br>

    <pre cols="72">-- 

    Alan Robertson <a href="mailto:alanr@unix.sh" target="_blank"><alanr@unix.sh></a>

"Openness is the foundation and preservative of friendship...  Let me claim from you at all times your undisguised opinions." - William Wilberforce

</pre>

  </font></div>

<br>_______________________________________________<br>

Pacemaker mailing list: <a href="mailto:Pacemaker@oss.clusterlabs.org">Pacemaker@oss.clusterlabs.org</a><br>

<a href="http://oss.clusterlabs.org/mailman/listinfo/pacemaker" target="_blank">http://oss.clusterlabs.org/mailman/listinfo/pacemaker</a><br>

<br>

Project Home: <a href="http://www.clusterlabs.org" target="_blank">http://www.clusterlabs.org</a><br>

Getting started: <a href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf" target="_blank">http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf</a><br>

Bugs: <a href="http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker" target="_blank">http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker</a><br>

<br></blockquote></div><br>