[Pacemaker] Node doesn't rejoin automatically after reboot - POSSIBLE CAUSE

Fri Jan 14 12:50:44 UTC 2011

On Thu, Jan 13, 2011 at 9:31 PM, Bob Haxo <bhaxo at sgi.com> wrote:
> Hi Tom (and Andrew),
>
> I figured out an easy fix for the problem that I encountered.  However,
> there would seem to be a problem lurking in the code.

Where there (m)any logs containing the text "crm_abort" from the PE in
your history (on the bad node)?
Thats the only way i can imagine so many copies of that file being open.

>
> Here is what I found.  On one of the servers that was online and hosting
> resources:
>
> r2lead1:~ # netstat -a | grep crm
> Proto RefCnt Flags       Type       State         I-Node Path
> unix  2      [ ACC ]     STREAM     LISTENING     18659  /var/run/crm/st_command
> unix  2      [ ACC ]     STREAM     LISTENING     18826  /var/run/crm/cib_rw
> unix  2      [ ACC ]     STREAM     LISTENING     19373  /var/run/crm/crmd
> unix  2      [ ACC ]     STREAM     LISTENING     18675  /var/run/crm/attrd
> unix  2      [ ACC ]     STREAM     LISTENING     18694  /var/run/crm/pengine
> unix  2      [ ACC ]     STREAM     LISTENING     18824  /var/run/crm/cib_callback
> unix  2      [ ACC ]     STREAM     LISTENING     18825  /var/run/crm/cib_ro
> unix  2      [ ACC ]     STREAM     LISTENING     18662  /var/run/crm/st_callback
> unix  3      [ ]         STREAM     CONNECTED     20659  /var/run/crm/cib_callback
> unix  3      [ ]         STREAM     CONNECTED     20656  /var/run/crm/cib_rw
> unix  3      [ ]         STREAM     CONNECTED     19952  /var/run/crm/attrd
> unix  3      [ ]         STREAM     CONNECTED     19944  /var/run/crm/st_callback
> unix  3      [ ]         STREAM     CONNECTED     19941  /var/run/crm/st_command
> unix  3      [ ]         STREAM     CONNECTED     19359  /var/run/crm/cib_callback
> unix  3      [ ]         STREAM     CONNECTED     19356  /var/run/crm/cib_rw
> unix  3      [ ]         STREAM     CONNECTED     19353  /var/run/crm/cib_callback
> unix  3      [ ]         STREAM     CONNECTED     19350  /var/run/crm/cib_rw
>
> On the node that was failing to join the HA cluster, this command
> returned nothing.
>
> However, on one of the functioning servers the above stream information
> was returned, but included an additional ** 941 ** instances of the
> following (with different I-Node numbers):
>
> unix  3      [ ]         STREAM     CONNECTED     1238243 /var/run/crm/pengine
> unix  3      [ ]         STREAM     CONNECTED     1237524 /var/run/crm/pengine
> unix  3      [ ]         STREAM     CONNECTED     1236698 /var/run/crm/pengine
> unix  3      [ ]         STREAM     CONNECTED     1235930 /var/run/crm/pengine
> unix  3      [ ]         STREAM     CONNECTED     1235094 /var/run/crm/pengine
>
> Here is how I corrected the situation:
>
> service openais stop on the 941 pengine stream system; service openais
> restart on the server that was failing to join the HA cluster.
>
> Results:
>
> The previously failing server joined the HA cluster and supports
> migration of resources to that server.
>
> service openais start of the server that had had the 941 pengine streams
> and that too came online.
>
> Regards,
> Bob Haxo
>
> On Thu, 2011-01-13 at 11:15 -0800, Bob Haxo wrote:
>> So, Tom ...how do you get the failed node online?
>>
>> I've re-installed with the same image that is running on three other
>> nodes, but still fails.  This node was quite happy for the past 3
>> months.  As I'm testing installs, this and other nodes have been
>> installed a significant number of times without this sort of failure.
>> I'd whack the whole HA cluster ... except that I don't want to run into
>> this failure again without better solution than "reinstall the
>> system" ;-)
>>
>> I'm looking at the information retuned with corosync debug enabled.
>> After startup, everything looks fine to me until hitting this apparent
>> local ipc delivery failure:
>>
>> Jan 13 10:09:10 corosync [TOTEM ] Delivering 2 to 3
>> Jan 13 10:09:10 corosync [TOTEM ] Delivering MCAST message with seq 3 to pending delivery queue
>> Jan 13 10:09:10 corosync [pcmk  ] WARN: route_ais_message: Sending message to local.crmd failed: ipc delivery failed (rc=-2)
>> Jan 13 10:09:10 corosync [pcmk  ] Msg[6486] (dest=local:crmd, from=r1lead1:crmd.11229, remote=true, size=181): <create_request_adv origin="post_cache_update" t="crmd" version="3.0.2" subt="request" ref
>> Jan 13 10:09:10 corosync [TOTEM ] mcasted message added to pending queue
>>
>> Guess that I'll have to renew my acquaintance with ipc.
>>
>> Bob Haxo
>>
>>
>>
>> On Thu, 2011-01-13 at 19:17 +0100, Tom Tux wrote:
>> > I don't know. I still have this issue (and it seems, that I'm not the
>> > only one...). I'll have a look, if there are pacemaker-updates through
>> > the zypper-update-channel available (sles11-sp1).
>> >
>> > Regards,
>> > Tom
>> >
>> >
>> > 2011/1/13 Bob Haxo <bhaxo at sgi.com>:
>> > > Tom, others,
>> > >
>> > > Please, what was the solution to this issue?
>> > >
>> > > Thanks,
>> > > Bob Haxo
>> > >
>> > > On Mon, 2010-09-06 at 09:50 +0200, Tom Tux wrote:
>> > >
>> > > Yes, corosync is running after the reboot. It comes up with the
>> > > regular init-procedure (runlevel 3 in my case).
>> > >
>> > > 2010/9/6 Andrew Beekhof <andrew at beekhof.net>:
>> > >> On Mon, Sep 6, 2010 at 7:57 AM, Tom Tux <tomtux80 at gmail.com> wrote:
>> > >>> No, I don't have such failed-messages. In my case, the "Connection to
>> > >>> our AIS plugin" was established.
>> > >>>
>> > >>> The /dev/shm is also not full.
>> > >>
>> > >> Is corosync running?
>> > >>
>> > >>> Kind regards,
>> > >>> Tom
>> > >>>
>> > >>> 2010/9/3 Michael Smith <msmith at cbnco.com>:
>> > >>>> Tom Tux wrote:
>> > >>>>
>> > >>>>> If I disjoin one clusternode (node01) for maintenance-purposes
>> > >>>>> (/etc/init.d/openais stop) and reboot this node, then it will not join
>> > >>>>> himself automatically into the cluster. After the reboot, I have the
>> > >>>>> following error- and warn-messages in the log:
>> > >>>>>
>> > >>>>> Sep  3 07:34:15 node01 mgmtd: [9202]: info: login to cib failed: live
>> > >>>>
>> > >>>> Do you have messages like this, too?
>> > >>>>
>> > >>>> Aug 30 15:48:10 xen-test1 corosync[5851]:  [IPC   ] Invalid IPC
>> > >>>> credentials.
>> > >>>> Aug 30 15:48:10 xen-test1 cib: [5858]: info: init_ais_connection:
>> > >>>> Connection to our AIS plugin (9) failed: unknown (100)
>> > >>>>
>> > >>>> Aug 30 15:48:10 xen-test1 cib: [5858]: CRIT: cib_init: Cannot sign in to
>> > >>>> the cluster... terminating
>> > >>>>
>> > >>>>
>> > >>>>
>> > >>>> http://news.gmane.org/find-root.php?message_id=%3c4C7C0EC7.2050708%40cbnco.com%3e
>> > >>>>
>> > >>>> Mike
>> > >>>>
>> > >>>> _______________________________________________
>> > >>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> > >>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> > >>>>
>> > >>>> Project Home: http://www.clusterlabs.org
>> > >>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> > >>>> Bugs:
>> > >>>>
>> > >>>> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>> > >>>>
>> > >>>
>> > >>> _______________________________________________
>> > >>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> > >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> > >>>
>> > >>> Project Home: http://www.clusterlabs.org
>> > >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> > >>> Bugs:
>> > >>> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>> > >>>
>> > >>
>> > >> _______________________________________________
>> > >> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> > >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> > >>
>> > >> Project Home: http://www.clusterlabs.org
>> > >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> > >> Bugs:
>> > >> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>> > >>
>> > >
>> > > _______________________________________________
>> > > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> > >
>> > > Project Home: http://www.clusterlabs.org
>> > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> > > Bugs:
>> > > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>> > >
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>