[Pacemaker] Node doesn't rejoin automatically after reboot - POSSIBLE CAUSE

Thu Jan 13 15:31:42 EST 2011

Hi Tom (and Andrew),

I figured out an easy fix for the problem that I encountered.  However,
there would seem to be a problem lurking in the code.

Here is what I found.  On one of the servers that was online and hosting
resources:

r2lead1:~ # netstat -a | grep crm
Proto RefCnt Flags       Type       State         I-Node Path
unix  2      [ ACC ]     STREAM     LISTENING     18659  /var/run/crm/st_command
unix  2      [ ACC ]     STREAM     LISTENING     18826  /var/run/crm/cib_rw
unix  2      [ ACC ]     STREAM     LISTENING     19373  /var/run/crm/crmd
unix  2      [ ACC ]     STREAM     LISTENING     18675  /var/run/crm/attrd
unix  2      [ ACC ]     STREAM     LISTENING     18694  /var/run/crm/pengine
unix  2      [ ACC ]     STREAM     LISTENING     18824  /var/run/crm/cib_callback
unix  2      [ ACC ]     STREAM     LISTENING     18825  /var/run/crm/cib_ro
unix  2      [ ACC ]     STREAM     LISTENING     18662  /var/run/crm/st_callback
unix  3      [ ]         STREAM     CONNECTED     20659  /var/run/crm/cib_callback
unix  3      [ ]         STREAM     CONNECTED     20656  /var/run/crm/cib_rw
unix  3      [ ]         STREAM     CONNECTED     19952  /var/run/crm/attrd
unix  3      [ ]         STREAM     CONNECTED     19944  /var/run/crm/st_callback
unix  3      [ ]         STREAM     CONNECTED     19941  /var/run/crm/st_command
unix  3      [ ]         STREAM     CONNECTED     19359  /var/run/crm/cib_callback
unix  3      [ ]         STREAM     CONNECTED     19356  /var/run/crm/cib_rw
unix  3      [ ]         STREAM     CONNECTED     19353  /var/run/crm/cib_callback
unix  3      [ ]         STREAM     CONNECTED     19350  /var/run/crm/cib_rw

On the node that was failing to join the HA cluster, this command
returned nothing.

However, on one of the functioning servers the above stream information
was returned, but included an additional ** 941 ** instances of the
following (with different I-Node numbers):

unix  3      [ ]         STREAM     CONNECTED     1238243 /var/run/crm/pengine
unix  3      [ ]         STREAM     CONNECTED     1237524 /var/run/crm/pengine
unix  3      [ ]         STREAM     CONNECTED     1236698 /var/run/crm/pengine
unix  3      [ ]         STREAM     CONNECTED     1235930 /var/run/crm/pengine
unix  3      [ ]         STREAM     CONNECTED     1235094 /var/run/crm/pengine

Here is how I corrected the situation:

service openais stop on the 941 pengine stream system; service openais
restart on the server that was failing to join the HA cluster.

Results:

The previously failing server joined the HA cluster and supports
migration of resources to that server.

service openais start of the server that had had the 941 pengine streams
and that too came online.

Regards,
Bob Haxo

On Thu, 2011-01-13 at 11:15 -0800, Bob Haxo wrote:
> So, Tom ...how do you get the failed node online?  
> 
> I've re-installed with the same image that is running on three other
> nodes, but still fails.  This node was quite happy for the past 3
> months.  As I'm testing installs, this and other nodes have been
> installed a significant number of times without this sort of failure.
> I'd whack the whole HA cluster ... except that I don't want to run into
> this failure again without better solution than "reinstall the
> system" ;-)
> 
> I'm looking at the information retuned with corosync debug enabled.
> After startup, everything looks fine to me until hitting this apparent
> local ipc delivery failure:
> 
> Jan 13 10:09:10 corosync [TOTEM ] Delivering 2 to 3
> Jan 13 10:09:10 corosync [TOTEM ] Delivering MCAST message with seq 3 to pending delivery queue
> Jan 13 10:09:10 corosync [pcmk  ] WARN: route_ais_message: Sending message to local.crmd failed: ipc delivery failed (rc=-2)
> Jan 13 10:09:10 corosync [pcmk  ] Msg[6486] (dest=local:crmd, from=r1lead1:crmd.11229, remote=true, size=181): <create_request_adv origin="post_cache_update" t="crmd" version="3.0.2" subt="request" ref
> Jan 13 10:09:10 corosync [TOTEM ] mcasted message added to pending queue
> 
> Guess that I'll have to renew my acquaintance with ipc. 
> 
> Bob Haxo
> 
> 
> 
> On Thu, 2011-01-13 at 19:17 +0100, Tom Tux wrote:
> > I don't know. I still have this issue (and it seems, that I'm not the
> > only one...). I'll have a look, if there are pacemaker-updates through
> > the zypper-update-channel available (sles11-sp1).
> > 
> > Regards,
> > Tom
> > 
> > 
> > 2011/1/13 Bob Haxo <bhaxo at sgi.com>:
> > > Tom, others,
> > >
> > > Please, what was the solution to this issue?
> > >
> > > Thanks,
> > > Bob Haxo
> > >
> > > On Mon, 2010-09-06 at 09:50 +0200, Tom Tux wrote:
> > >
> > > Yes, corosync is running after the reboot. It comes up with the
> > > regular init-procedure (runlevel 3 in my case).
> > >
> > > 2010/9/6 Andrew Beekhof <andrew at beekhof.net>:
> > >> On Mon, Sep 6, 2010 at 7:57 AM, Tom Tux <tomtux80 at gmail.com> wrote:
> > >>> No, I don't have such failed-messages. In my case, the "Connection to
> > >>> our AIS plugin" was established.
> > >>>
> > >>> The /dev/shm is also not full.
> > >>
> > >> Is corosync running?
> > >>
> > >>> Kind regards,
> > >>> Tom
> > >>>
> > >>> 2010/9/3 Michael Smith <msmith at cbnco.com>:
> > >>>> Tom Tux wrote:
> > >>>>
> > >>>>> If I disjoin one clusternode (node01) for maintenance-purposes
> > >>>>> (/etc/init.d/openais stop) and reboot this node, then it will not join
> > >>>>> himself automatically into the cluster. After the reboot, I have the
> > >>>>> following error- and warn-messages in the log:
> > >>>>>
> > >>>>> Sep  3 07:34:15 node01 mgmtd: [9202]: info: login to cib failed: live
> > >>>>
> > >>>> Do you have messages like this, too?
> > >>>>
> > >>>> Aug 30 15:48:10 xen-test1 corosync[5851]:  [IPC   ] Invalid IPC
> > >>>> credentials.
> > >>>> Aug 30 15:48:10 xen-test1 cib: [5858]: info: init_ais_connection:
> > >>>> Connection to our AIS plugin (9) failed: unknown (100)
> > >>>>
> > >>>> Aug 30 15:48:10 xen-test1 cib: [5858]: CRIT: cib_init: Cannot sign in to
> > >>>> the cluster... terminating
> > >>>>
> > >>>>
> > >>>>
> > >>>> http://news.gmane.org/find-root.php?message_id=%3c4C7C0EC7.2050708%40cbnco.com%3e
> > >>>>
> > >>>> Mike
> > >>>>
> > >>>> _______________________________________________
> > >>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > >>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> > >>>>
> > >>>> Project Home: http://www.clusterlabs.org
> > >>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > >>>> Bugs:
> > >>>>
> > >>>> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
> > >>>>
> > >>>
> > >>> _______________________________________________
> > >>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> > >>>
> > >>> Project Home: http://www.clusterlabs.org
> > >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > >>> Bugs:
> > >>> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
> > >>>
> > >>
> > >> _______________________________________________
> > >> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> > >>
> > >> Project Home: http://www.clusterlabs.org
> > >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > >> Bugs:
> > >> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
> > >>
> > >
> > > _______________________________________________
> > > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> > >
> > > Project Home: http://www.clusterlabs.org
> > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > > Bugs:
> > > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
> > >