[Pacemaker] Removed nodes showing back in status

Fri May 25 12:59:14 EDT 2012

On Wed, May 16, 2012 at 1:53 PM, David Vossel <dvossel at redhat.com> wrote:
> ----- Original Message -----
>> From: "Larry Brigman" <larry.brigman at gmail.com>
>> To: "The Pacemaker cluster resource manager" <pacemaker at oss.clusterlabs.org>
>> Sent: Monday, May 14, 2012 4:59:55 PM
>> Subject: Re: [Pacemaker] Removed nodes showing back in status
>>
>> On Mon, May 14, 2012 at 2:13 PM, David Vossel <dvossel at redhat.com>
>> wrote:
>> > ----- Original Message -----
>> >> From: "Larry Brigman" <larry.brigman at gmail.com>
>> >> To: "The Pacemaker cluster resource manager"
>> >> <pacemaker at oss.clusterlabs.org>
>> >> Sent: Monday, May 14, 2012 1:30:22 PM
>> >> Subject: Re: [Pacemaker] Removed nodes showing back in status
>> >>
>> >> On Mon, May 14, 2012 at 9:54 AM, Larry Brigman
>> >> <larry.brigman at gmail.com> wrote:
>> >> > I have a 5 node cluster (but it could be any number of nodes, 3
>> >> > or
>> >> > larger).
>> >> > I am testing some scripts for node removal.
>> >> > I remove a node from the cluster and everything looks correct
>> >> > from
>> >> > crm
>> >> > status standpoint.
>> >> > When I remove a second node, the first node that was removed now
>> >> > shows back
>> >> > in the crm status as off-line.  I'm following the guidelines
>> >> > provided
>> >> > in Pacemaker Explained docs.
>> >> > http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-node-delete.html
>> >> >
>> >> > I believe this is a bug but want to put it out to the list to be
>> >> > sure.
>> >> > Versions.
>> >> > RHEL5.7 x86_64
>> >> > corosync-1.4.2
>> >> > openais-1.1.3
>> >> > pacemaker-1.1.5
>> >> >
>> >> > Status after first node removed
>> >> > [root at portland-3 ~]# crm status
>> >> > ============
>> >> > Last updated: Mon May 14 08:42:04 2012
>> >> > Stack: openais
>> >> > Current DC: portland-1 - partition with quorum
>> >> > Version: 1.1.5-1.3.sme-01e86afaaa6d4a8c4836f68df80ababd6ca3902f
>> >> > 4 Nodes configured, 4 expected votes
>> >> > 0 Resources configured.
>> >> > ============
>> >> >
>> >> > Online: [ portland-1 portland-2 portland-3 portland-4 ]
>> >> >
>> >> > Status after second node removed.
>> >> > [root at portland-3 ~]# crm status
>> >> > ============
>> >> > Last updated: Mon May 14 08:42:45 2012
>> >> > Stack: openais
>> >> > Current DC: portland-1 - partition with quorum
>> >> > Version: 1.1.5-1.3.sme-01e86afaaa6d4a8c4836f68df80ababd6ca3902f
>> >> > 4 Nodes configured, 3 expected votes
>> >> > 0 Resources configured.
>> >> > ============
>> >> >
>> >> > Online: [ portland-1 portland-3 portland-4 ]
>> >> > OFFLINE: [ portland-5 ]
>> >> >
>> >> > Both nodes were removed from the cluster from node 1.
>> >>
>> >> When I added a node back into the cluster the second node
>> >> that was removed now shows as offline.
>> >
>> > The only time I've seen this sort of behavior is when I don't
>> > completely shutdown corosync and pacemaker on the node I'm
>> > removing before I delete it's configuration from the cib.  Are you
>> > sure corosync and pacemaker are gone before you delete the node
>> > from the cluster config?
>>
>> Well, I run service pacemaker stop and service corosync stop prior to
>> doing
>> the remove.  Since I am doing it all in a script it's possible that
>> there
>> is a race condition that I have just expose or the services are not
>> fully down
>> when the service script exits.
>
> Yep, If you are waiting for the service scripts to return I would expect it to be safe to remove the nodes at that point.
>
>> BTW, I'm running pacemaker as it's own process instead of being a
>> child of
>> corosync (if that makes a difference).
>>
>
> This shouldn't matter.
>
> An hb_report of this will help us distinguish if this is a bug or not.
Bug opened with the hb and crm reports.
https://developerbugs.linuxfoundation.org/show_bug.cgi?id=2648

>
> -- Vossel
>
>> [root at portland-3 ~]# cat /etc/corosync/service.d/pcmk
>> service {
>>         # Load the Pacemaker Cluster Resource Manager
>>         ver:  1
>>         name: pacemaker
>> #        use_mgmtd: yes
>> #        use_logd:  yes
>> }
>>
>> It looks from corosync a removed node and a down node have the same
>> object
>> state.
>> 4.0.0.2 is removed 4.0.0.5 is stopped.
>>
>> [root at portland-3 ~]# corosync-objctl -a | grep member
>> runtime.totem.pg.mrp.srp.members.16777220.ip=r(0) ip(4.0.0.1)
>> runtime.totem.pg.mrp.srp.members.16777220.join_count=1
>> runtime.totem.pg.mrp.srp.members.16777220.status=joined
>> runtime.totem.pg.mrp.srp.members.50331652.ip=r(0) ip(4.0.0.3)
>> runtime.totem.pg.mrp.srp.members.50331652.join_count=1
>> runtime.totem.pg.mrp.srp.members.50331652.status=joined
>> runtime.totem.pg.mrp.srp.members.67108868.ip=r(0) ip(4.0.0.4)
>> runtime.totem.pg.mrp.srp.members.67108868.join_count=3
>> runtime.totem.pg.mrp.srp.members.67108868.status=joined
>> runtime.totem.pg.mrp.srp.members.83886084.ip=r(0) ip(4.0.0.5)
>> runtime.totem.pg.mrp.srp.members.83886084.join_count=4
>> runtime.totem.pg.mrp.srp.members.83886084.status=joined
>> runtime.totem.pg.mrp.srp.members.33554436.ip=r(0) ip(4.0.0.2)
>> runtime.totem.pg.mrp.srp.members.33554436.join_count=1
>> runtime.totem.pg.mrp.srp.members.33554436.status=left
>>
>> >
>> > -- Vossel
>> >
>> >> [root at portland-3 ~]# crm status
>> >> ============
>> >> Last updated: Mon May 14 11:27:55 2012
>> >> Stack: openais
>> >> Current DC: portland-1 - partition with quorum
>> >> Version: 1.1.5-1.3.sme-01e86afaaa6d4a8c4836f68df80ababd6ca3902f
>> >> 5 Nodes configured, 4 expected votes
>> >> 0 Resources configured.
>> >> ============
>> >>
>> >> Online: [ portland-1 portland-3 portland-4 portland-5 ]
>> >> OFFLINE: [ portland-2 ]
>> >>
>> >> _______________________________________________
>> >> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> >>
>> >> Project Home: http://www.clusterlabs.org
>> >> Getting started:
>> >> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> >> Bugs: http://bugs.clusterlabs.org
>> >>
>> >
>> > _______________________________________________
>> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> >
>> > Project Home: http://www.clusterlabs.org
>> > Getting started:
>> > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> > Bugs: http://bugs.clusterlabs.org
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started:
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org