[Pacemaker] Cluster goes to (unmanaged) Failed state when both nodes are rebooted together

Mon Oct 24 10:23:16 EDT 2011

Hi All,

I 've pacemaker running with corosync. Following is myCRM configuration.

node soalaba56
node soalaba63
primitive FloatingIP ocf:heartbeat:IPaddr2 \
        params ip="<floating_ip>" nic="eth0:0"
primitive acestatus lsb:acestatus \
primitive pingd ocf:pacemaker:ping \
        params host_list="<gateway_ip>" multiplier="100" \
        op monitor interval="15s" timeout="5s"
group HAService FloatingIP acestatus \
        meta target-role="Started"
clone pingdclone pingd \
        meta globally-unique="false"
location ip1_location FloatingIP \
        rule $id="ip1_location-rule" pingd: defined pingd
property $id="cib-bootstrap-options" \
        dc-version="1.1.5-5.el6-01e86afaaa6d4a8c4836f68df80ababd6ca3902f" \
        cluster-infrastructure="openais" \
        expected-quorum-votes="2" \
        stonith-enabled="false" \
        no-quorum-policy="ignore" \
        last-lrm-refresh="1305736421"
----------------------------------------------------------------------

When I reboot both the nodes together, cluster goes into an (unmanaged) Failed state as shown below.

============
Last updated: Mon Oct 24 08:10:42 2011
Stack: openais
Current DC: soalaba63 - partition with quorum
Version: 1.1.5-5.el6-01e86afaaa6d4a8c4836f68df80ababd6ca3902f
2 Nodes configured, 2 expected votes
2 Resources configured.
============

Online: [ soalaba56 soalaba63 ]

 Resource Group: HAService
     FloatingIP (ocf::heartbeat:IPaddr2) Started  (unmanaged) FAILED[   soalaba63       soalaba56 ]
     acestatus  (lsb:acestatus):        Stopped
 Clone Set: pingdclone [pingd]
     Started: [ soalaba56 soalaba63 ]

Failed actions:
    FloatingIP_stop_0 (node=soalaba63, call=7, rc=1, status=complete): unknown error
    FloatingIP_stop_0 (node=soalaba56, call=7, rc=1, status=complete): unknown error

------------------------------------------------------------------------------

This happens only when the reboot is done simultaneously on both the nodes. If reboot is done with some interval in between this is not seen. Looking into the logs I see that  when the nodes come up resources are started on both the nodes and then it tries to stop the started resources and fails there. 

I've attached the logs.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/pacemaker/attachments/20111024/25b5d27e/attachment-0002.html>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: logs.txt
URL: <http://lists.clusterlabs.org/pipermail/pacemaker/attachments/20111024/25b5d27e/attachment-0002.txt>