[Pacemaker] Cluster goes to (unmanaged) Failed state when both nodes are rebooted together

Tue Oct 25 10:22:16 EDT 2011

Hi,

I've tried moving the corosync startup from S20 to S98 but the issue is still there.

Maybe I'll have to remove it from init and write an upstart for corosync.

________________________________
From: Andreas Kurz <andreas at hastexo.com>
To: pacemaker at oss.clusterlabs.org
Sent: Tuesday, 25 October 2011 6:50 PM
Subject: Re: [Pacemaker] Cluster goes to (unmanaged) Failed state when both nodes are rebooted together

hello,

On 10/25/2011 09:17 AM, ihjaz Mohamed wrote:
> If I start the corosync together on both the servers, it comes up good.
> So am just wondering how is this different from corosync being started
> by the server during boot up.

maybe corosync ist started to early on system boot when network
connectivity is not fully established.

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

> 
> 
> ------------------------------------------------------------------------
> *From:* Andreas Kurz <andreas at hastexo.com>
> *To:* pacemaker at oss.clusterlabs.org
> *Sent:* Monday, 24 October 2011 9:30 PM
> *Subject:* Re: [Pacemaker] Cluster goes to (unmanaged) Failed state when
> both nodes are rebooted together
> 
> hello,
> 
> On 10/24/2011 05:21 PM, ihjaz Mohamed wrote:
>> Its part of the requirement given to me to support this solution on
>> servers without stonith devices. So I cannot enable the stonith.
> 
> Too bad, than you have to live with some limitations of this setup. You
> could add some random wait to/before corosync start ... or simply: don't
> reboot them at the same time ;-)
> 
> But it would also be interesting why FloatingIP_stop_0 returns an error
> on both nodes ... logs should tell you what happened.
> 
> .... and remove nic="eth0:0", you must not define any alias here but
> only the nic itself.
> 
> Regards,
> Andreas
> 
> -- 
> Need help with Pacemaker?
> http://www.hastexo.com/now
> 
> 
>>
>> ------------------------------------------------------------------------
>> *From:* Alan Robertson <alanr at unix.sh <mailto:alanr at unix.sh>>
>> *To:* ihjaz Mohamed <ihjazmohamed at yahoo.co.in
> <mailto:ihjazmohamed at yahoo.co.in>>; The Pacemaker clusterFloatingIP_stop_0
>> resource manager <pacemaker at oss.clusterlabs.org
> <mailto:pacemaker at oss.clusterlabs.org>>
>> *Sent:* Monday, 24 October 2011 8:22 PM
>> *Subject:* Re: [Pacemaker] Cluster goes to (unmanaged) Failed state when
>> both nodes are rebooted together
>>
>> Setting no-quorum-policy to ignore and disabling stonith is not a good
>> idea.  You're sort of inviting the cluster to do screwed up things.
>>
>>
>> On 10/24/2011 08:23 AM, ihjaz Mohamed wrote:
>>> Hi All,
>>>
>>> I 've pacemaker running with corosync. Following is my CRM configuration.
>>>
>>> node soalaba56
>>> node soalaba63
>>> primitive FloatingIP ocf:heartbeat:IPaddr2 \
>>>        params ip="<floating_ip>" nic="eth0:0"
>>> primitive acestatus lsb:acestatus \
>>> primitive pingd ocf:pacemaker:ping \
>>>        params host_list="<gateway_ip>" multiplier="100" \
>>>        op monitor interval="15s" timeout="5s"
>>> group HAService FloatingIP acestatus \
>>>        meta target-role="Started"
>>> clone pingdclone pingd \
>>>        meta globally-unique="false"
>>> location ip1_location FloatingIP \
>>>        rule $id="ip1_location-rule" pingd: defined pingd
>>> property $id="cib-bootstrap-options" \
>>>      
>>> dc-version="1.1.5-5.el6-01e86afaaa6d4a8c4836f68df80ababd6ca3902f" \
>>>        cluster-infrastructure="openais" \
>>>        expected-quorum-votes="2" \
>>>        stonith-enabled="false" \
>>>        no-quorum-policy="ignore" \
>>>        last-lrm-refresh="1305736421"
>>> ----------------------------------------------------------------------
>>>
>>> When I reboot both the nodes together, cluster goes into an
>>> (unmanaged) Failed state as shown below.
>>>
>>>
>>> ============
>>> Last updated: Mon Oct 24 08:10:42 2011
>>> Stack: openais
>>> Current DC: soalaba63 - partition with quorum
>>> Version: 1.1.5-5.el6-01e86afaaa6d4a8c4836f68df80ababd6ca3902f
>>> 2 Nodes configured, 2 expected votes
>>> 2 Resources configured.
>>> ============
>>>
>>> Online: [ soalaba56 soalaba63 ]
>>>
>>>  Resource Group: HAService
>>>      FloatingIP (ocf::heartbeat:IPaddr2) Started  (unmanaged)
>>> FAILED[  soalaba63      soalaba56 ]
>>>      acestatus  (lsb:acestatus):        Stopped
>>>  Clone Set: pingdclone [pingd]
>>>      Started: [ soalaba56 soalaba63 ]
>>>
>>> Failed actions:
>>>    FloatingIP_stop_0 (node=soalaba63, call=7, rc=1, status=complete):
>>> unknown error
>>>    FloatingIP_stop_0 (node=soalaba56, call=7, rc=1, status=complete):
>>> unknown error
>>>
> ------------------------------------------------------------------------------
>>>
>>> This happens only when the reboot is done simultaneously on both the
>>> nodes. If reboot is done with some interval in between this is not
>>> seen. Looking into the logs I see that  when the nodes come up
>>> resources are started on both the nodes and then it tries to stop the
>>> started resources and fails there.
>>>
>>> I've attached the logs.
>>>
>>>
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> <mailto:Pacemaker at oss.clusterlabs.org>
> <mailto:Pacemaker at oss.clusterlabs.org
> <mailto:Pacemaker at oss.clusterlabs.org>>
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs:
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>>
>>
>> --
>>    Alan Robertson <alanr at unix.sh <mailto:alanr at unix.sh>>
> <mailto:alanr at unix.sh <mailto:alanr at unix.sh>>
>>
>> "Openness is the foundation and preservative of friendship...  Let me
> claim from you at all times your undisguised opinions." - William
> Wilberforce
>>
>>
>>
>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> <mailto:Pacemaker at oss.clusterlabs.org>
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs:
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
> 
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> <mailto:Pacemaker at oss.clusterlabs.org>
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs:
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
> 
> 
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

_______________________________________________
Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20111025/afc7c38e/attachment-0003.html>