[Pacemaker] Cluster goes to (unmanaged) Failed state when both nodes are rebooted together

ihjaz Mohamed ihjazmohamed at yahoo.co.in
Mon Oct 24 10:23:16 EDT 2011

Hi All,

I 've pacemaker running with corosync. Following is myCRM configuration.

node soalaba56
node soalaba63
primitive FloatingIP ocf:heartbeat:IPaddr2 \
        params ip="<floating_ip>" nic="eth0:0"
primitive acestatus lsb:acestatus \
primitive pingd ocf:pacemaker:ping \
        params host_list="<gateway_ip>" multiplier="100" \
        op monitor interval="15s" timeout="5s"
group HAService FloatingIP acestatus \
        meta target-role="Started"
clone pingdclone pingd \
        meta globally-unique="false"
location ip1_location FloatingIP \
        rule $id="ip1_location-rule" pingd: defined pingd
property $id="cib-bootstrap-options" \
        dc-version="1.1.5-5.el6-01e86afaaa6d4a8c4836f68df80ababd6ca3902f" \
        cluster-infrastructure="openais" \
        expected-quorum-votes="2" \
        stonith-enabled="false" \
        no-quorum-policy="ignore" \

When I reboot both the nodes together, cluster goes into an (unmanaged) Failed state as shown below.

Last updated: Mon Oct 24 08:10:42 2011
Stack: openais
Current DC: soalaba63 - partition with quorum
Version: 1.1.5-5.el6-01e86afaaa6d4a8c4836f68df80ababd6ca3902f
2 Nodes configured, 2 expected votes
2 Resources configured.

Online: [ soalaba56 soalaba63 ]

 Resource Group: HAService
     FloatingIP (ocf::heartbeat:IPaddr2) Started  (unmanaged) FAILED[   soalaba63       soalaba56 ]
     acestatus  (lsb:acestatus):        Stopped
 Clone Set: pingdclone [pingd]
     Started: [ soalaba56 soalaba63 ]

Failed actions:
    FloatingIP_stop_0 (node=soalaba63, call=7, rc=1, status=complete): unknown error
    FloatingIP_stop_0 (node=soalaba56, call=7, rc=1, status=complete): unknown error


This happens only when the reboot is done simultaneously on both the nodes. If reboot is done with some interval in between this is not seen. Looking into the logs I see that  when the nodes come up resources are started on both the nodes and then it tries to stop the started resources and fails there. 

I've attached the logs.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/pacemaker/attachments/20111024/25b5d27e/attachment-0002.html>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: logs.txt
URL: <http://lists.clusterlabs.org/pipermail/pacemaker/attachments/20111024/25b5d27e/attachment-0002.txt>

More information about the Pacemaker mailing list