[Pacemaker] Node fails to rejoin cluster

Thu Feb 7 00:58:37 EST 2013

On Wed, Feb 6, 2013 at 9:11 PM, Tal Yalon <yalon at infinidat.com> wrote:
> Hi all,
>
> I have a 2-node cluster, where node-2 got fenced and now after reboot tries
> to rejoin the cluster but fails and gets stuck in a loop for hours and never
> joins back.
>
> After another reboot it managed to join, and there was no time difference
> between the nodes.
>
> Below is corosync/pacemaker log of node-2 (the one that was stuck in the
> loop).

Unfortunately we need the other one.

>Any help would be appreciated, since I have no clue as to what
> happened.
>
> Thanks,
> Tal
>
>
> Feb  6 01:39:32 node-2 corosync[27428]:   [MAIN  ] Corosync Cluster Engine
> ('1.4.1'): started and ready to provide service.
> Feb  6 01:39:32 node-2 corosync[27428]:   [MAIN  ] Corosync built-in
> features: nss dbus rdma snmp
> Feb  6 01:39:32 node-2 corosync[27428]:   [MAIN  ] Successfully read main
> configuration file '/etc/corosync/corosync.conf'.
> Feb  6 01:39:32 node-2 corosync[27428]:   [TOTEM ] Initializing transport
> (UDP/IP Unicast).
> Feb  6 01:39:32 node-2 corosync[27428]:   [TOTEM ] Initializing
> transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
> Feb  6 01:39:32 node-2 corosync[27428]:   [TOTEM ] The network interface
> [9.151.142.20] is now up.
> Feb  6 01:39:32 node-2 corosync[27428]:   [pcmk  ] Logging: Initialized
> pcmk_startup
> Feb  6 01:39:32 node-2 corosync[27428]:   [SERV  ] Service engine loaded:
> Pacemaker Cluster Manager 1.1.6
> Feb  6 01:39:32 node-2 corosync[27428]:   [SERV  ] Service engine loaded:
> corosync extended virtual synchrony service
> Feb  6 01:39:32 node-2 corosync[27428]:   [SERV  ] Service engine loaded:
> corosync configuration service
> Feb  6 01:39:32 node-2 corosync[27428]:   [SERV  ] Service engine loaded:
> corosync cluster closed process group service v1.01
> Feb  6 01:39:32 node-2 corosync[27428]:   [SERV  ] Service engine loaded:
> corosync cluster config database access v1.01
> Feb  6 01:39:32 node-2 corosync[27428]:   [SERV  ] Service engine loaded:
> corosync profile loading service
> Feb  6 01:39:32 node-2 corosync[27428]:   [SERV  ] Service engine loaded:
> corosync cluster quorum service v0.1
> Feb  6 01:39:32 node-2 corosync[27428]:   [MAIN  ] Compatibility mode set to
> whitetank.  Using V1 and V2 of the synchronization engine.
> Feb  6 01:39:32 node-2 corosync[27428]:   [TOTEM ] A processor joined or
> left the membership and a new membership was formed.
> Feb  6 01:39:32 node-2 corosync[27428]:   [CPG   ] chosen downlist: sender
> r(0) ip(9.151.142.20) ; members(old:0 left:0)
> Feb  6 01:39:32 node-2 corosync[27428]:   [MAIN  ] Completed service
> synchronization, ready to provide service.
> Feb  6 01:39:37 node-2 pacemakerd[27466]:     info: crm_log_init_worker:
> Changed active directory to /var/lib/heartbeat/cores/root
> Feb  6 01:39:37 node-2 pacemakerd[27466]:   notice: main: Starting Pacemaker
> 1.1.7-6.el6 (Build: 148fccfd5985c5590cc601123c6c16e966b85d14):
> generated-manpages agent-manpages ascii-docs publican-docs ncurses
> trace-logging libqb  corosync-plugin cman
> Feb  6 01:39:37 node-2 pacemakerd[27466]:     info: main: Maximum core file
> size is: 18446744073709551615
> Feb  6 01:39:37 node-2 pacemakerd[27466]:   notice: update_node_processes:
> 0xb31fe0 Node 2 now known as node-2, was:
> Feb  6 01:39:37 node-2 pacemakerd[27466]:     info: start_child: Forked
> child 27470 for process cib
> Feb  6 01:39:37 node-2 pacemakerd[27466]:     info: start_child: Forked
> child 27471 for process stonith-ng
> Feb  6 01:39:37 node-2 pacemakerd[27466]:     info: start_child: Forked
> child 27472 for process lrmd
> Feb  6 01:39:37 node-2 pacemakerd[27466]:     info: start_child: Forked
> child 27473 for process attrd
> Feb  6 01:39:37 node-2 pacemakerd[27466]:     info: start_child: Forked
> child 27474 for process pengine
> Feb  6 01:39:37 node-2 pacemakerd[27466]:     info: start_child: Forked
> child 27475 for process crmd
> Feb  6 01:39:37 node-2 pacemakerd[27466]:     info: main: Starting mainloop
> Feb  6 01:39:37 node-2 lrmd: [27472]: info: G_main_add_SignalHandler: Added
> signal handler for signal 15
> Feb  6 01:39:37 node-2 stonith-ng[27471]:     info: crm_log_init_worker:
> Changed active directory to /var/lib/heartbeat/cores/root
> Feb  6 01:39:37 node-2 stonith-ng[27471]:     info: get_cluster_type:
> Cluster type is: 'openais'
> Feb  6 01:39:37 node-2 stonith-ng[27471]:   notice: crm_cluster_connect:
> Connecting to cluster infrastructure: classic openais (with plugin)
> Feb  6 01:39:37 node-2 stonith-ng[27471]:     info:
> init_ais_connection_classic: Creating connection to our Corosync plugin
> Feb  6 01:39:37 node-2 stonith-ng[27471]:     info:
> init_ais_connection_classic: AIS connection established
> Feb  6 01:39:37 node-2 stonith-ng[27471]:     info: get_ais_nodeid: Server
> details: id=2 uname=node-2 cname=pcmk
> Feb  6 01:39:37 node-2 stonith-ng[27471]:     info:
> init_ais_connection_once: Connection to 'classic openais (with plugin)':
> established
> Feb  6 01:39:37 node-2 stonith-ng[27471]:     info: crm_new_peer: Node
> node-2 now has id: 2
> Feb  6 01:39:37 node-2 stonith-ng[27471]:     info: crm_new_peer: Node 2 is
> now known as node-2
> Feb  6 01:39:37 node-2 crmd[27475]:     info: crm_log_init_worker: Changed
> active directory to /var/lib/heartbeat/cores/hacluster
> Feb  6 01:39:37 node-2 crmd[27475]:   notice: main: CRM Hg Version:
> 148fccfd5985c5590cc601123c6c16e966b85d14
> Feb  6 01:39:37 node-2 lrmd: [27472]: info: G_main_add_SignalHandler: Added
> signal handler for signal 17
> Feb  6 01:39:37 node-2 lrmd: [27472]: info: enabling coredumps
> Feb  6 01:39:37 node-2 lrmd: [27472]: info: G_main_add_SignalHandler: Added
> signal handler for signal 10
> Feb  6 01:39:37 node-2 lrmd: [27472]: info: G_main_add_SignalHandler: Added
> signal handler for signal 12
> Feb  6 01:39:37 node-2 lrmd: [27472]: info: Started.
> Feb  6 01:39:37 node-2 cib[27470]:     info: crm_log_init_worker: Changed
> active directory to /var/lib/heartbeat/cores/hacluster
> Feb  6 01:39:37 node-2 cib[27470]:     info: retrieveCib: Reading cluster
> configuration from: /var/lib/heartbeat/crm/cib.xml (digest:
> /var/lib/heartbeat/crm/cib.xml.sig)
> Feb  6 01:39:37 node-2 attrd[27473]:   notice: crm_cluster_connect:
> Connecting to cluster infrastructure: classic openais (with plugin)
> Feb  6 01:39:37 node-2 attrd[27473]:   notice: main: Starting mainloop...
> Feb  6 01:39:37 node-2 cib[27470]:     info: validate_with_relaxng: Creating
> RNG parser context
> Feb  6 01:39:37 node-2 cib[27470]:     info: startCib: CIB Initialization
> completed successfully
> Feb  6 01:39:37 node-2 cib[27470]:     info: get_cluster_type: Cluster type
> is: 'openais'
> Feb  6 01:39:37 node-2 cib[27470]:   notice: crm_cluster_connect: Connecting
> to cluster infrastructure: classic openais (with plugin)
> Feb  6 01:39:37 node-2 cib[27470]:     info: init_ais_connection_classic:
> Creating connection to our Corosync plugin
> Feb  6 01:39:37 node-2 cib[27470]:     info: init_ais_connection_classic:
> AIS connection established
> Feb  6 01:39:37 node-2 cib[27470]:     info: get_ais_nodeid: Server details:
> id=2 uname=node-2 cname=pcmk
> Feb  6 01:39:37 node-2 cib[27470]:     info: init_ais_connection_once:
> Connection to 'classic openais (with plugin)': established
> Feb  6 01:39:37 node-2 cib[27470]:     info: crm_new_peer: Node node-2 now
> has id: 2
> Feb  6 01:39:37 node-2 cib[27470]:     info: crm_new_peer: Node 2 is now
> known as node-2
> Feb  6 01:39:37 node-2 cib[27470]:     info: cib_init: Starting cib mainloop
> Feb  6 01:39:37 node-2 cib[27470]:     info: ais_dispatch_message:
> Membership 900: quorum still lost
> Feb  6 01:39:37 node-2 cib[27470]:     info: crm_update_peer: Node node-2:
> id=2 state=member (new) addr=r(0) ip(9.151.142.20)  (new) votes=1 (new)
> born=0 seen=900 proc=00000000000000000000000000000000
> Feb  6 01:39:37 node-2 cib[27470]:     info: ais_dispatch_message:
> Membership 904: quorum still lost
> Feb  6 01:39:37 node-2 cib[27470]:     info: crm_new_peer: Node <null> now
> has id: 1
> Feb  6 01:39:37 node-2 cib[27470]:     info: crm_update_peer: Node (null):
> id=1 state=member (new) addr=r(0) ip(9.151.142.10)  votes=0 born=0 seen=904
> proc=00000000000000000000000000000000
> Feb  6 01:39:37 node-2 corosync[27428]:   [TOTEM ] A processor joined or
> left the membership and a new membership was formed.
> Feb  6 01:39:37 node-2 cib[27470]:   notice: ais_dispatch_message:
> Membership 904: quorum acquired
> Feb  6 01:39:37 node-2 cib[27470]:     info: crm_get_peer: Node 1 is now
> known as node-1
> Feb  6 01:39:37 node-2 cib[27470]:     info: crm_update_peer: Node node-1:
> id=1 state=member addr=r(0) ip(9.151.142.10)  votes=1 (new) born=896
> seen=904 proc=00000000000000000000000000000000
> Feb  6 01:39:37 node-2 corosync[27428]:   [CPG   ] chosen downlist: sender
> r(0) ip(9.151.142.10) ; members(old:1 left:0)
> Feb  6 01:39:37 node-2 corosync[27428]:   [MAIN  ] Completed service
> synchronization, ready to provide service.
> Feb  6 01:39:38 node-2 cib[27470]:  warning: cib_peer_callback: Discarding
> cib_apply_diff message (30) from node-1: not in our membership
> Feb  6 01:39:38 node-2 cib[27470]:  warning: cib_peer_callback: Discarding
> cib_apply_diff message (31) from node-1: not in our membership
> Feb  6 01:39:38 node-2 cib[27470]:  warning: cib_peer_callback: Discarding
> cib_apply_diff message (32) from node-1: not in our membership
> Feb  6 01:39:38 node-2 cib[27470]:  warning: cib_peer_callback: Discarding
> cib_apply_diff message (33) from node-1: not in our membership
> Feb  6 01:39:38 node-2 cib[27470]:  warning: cib_peer_callback: Discarding
> cib_apply_diff message (34) from node-1: not in our membership
> Feb  6 01:39:38 node-2 cib[27470]:  warning: cib_peer_callback: Discarding
> cib_apply_diff message (35) from node-1: not in our membership
> Feb  6 01:39:38 node-2 cib[27470]:  warning: cib_peer_callback: Discarding
> cib_apply_diff message (36) from node-1: not in our membership
> Feb  6 01:39:38 node-2 cib[27470]:  warning: cib_peer_callback: Discarding
> cib_apply_diff message (37) from node-1: not in our membership
> Feb  6 01:39:38 node-2 cib[27470]:  warning: cib_peer_callback: Discarding
> cib_apply_diff message (38) from node-1: not in our membership
> Feb  6 01:39:38 node-2 stonith-ng[27471]:   notice: setup_cib: Watching for
> stonith topology changes
> Feb  6 01:39:38 node-2 stonith-ng[27471]:     info: main: Starting
> stonith-ng mainloop
> Feb  6 01:39:38 node-2 crmd[27475]:     info: do_cib_control: CIB connection
> established
> Feb  6 01:39:38 node-2 crmd[27475]:     info: get_cluster_type: Cluster type
> is: 'openais'
> Feb  6 01:39:38 node-2 crmd[27475]:   notice: crm_cluster_connect:
> Connecting to cluster infrastructure: classic openais (with plugin)
> Feb  6 01:39:38 node-2 crmd[27475]:     info: init_ais_connection_classic:
> Creating connection to our Corosync plugin
> Feb  6 01:39:38 node-2 crmd[27475]:     info: init_ais_connection_classic:
> AIS connection established
> Feb  6 01:39:38 node-2 crmd[27475]:     info: get_ais_nodeid: Server
> details: id=2 uname=node-2 cname=pcmk
> Feb  6 01:39:38 node-2 crmd[27475]:     info: init_ais_connection_once:
> Connection to 'classic openais (with plugin)': established
> Feb  6 01:39:38 node-2 crmd[27475]:     info: crm_new_peer: Node node-2 now
> has id: 2
> Feb  6 01:39:38 node-2 crmd[27475]:     info: crm_new_peer: Node 2 is now
> known as node-2
> Feb  6 01:39:38 node-2 crmd[27475]:     info: ais_status_callback: status:
> node-2 is now unknown
> Feb  6 01:39:38 node-2 crmd[27475]:     info: do_ha_control: Connected to
> the cluster
> Feb  6 01:39:38 node-2 crmd[27475]:     info: do_started: Delaying start, no
> membership data (0000000000100000)
> Feb  6 01:39:38 node-2 crmd[27475]:   notice: ais_dispatch_message:
> Membership 904: quorum acquired
> Feb  6 01:39:38 node-2 crmd[27475]:     info: ais_status_callback: status:
> node-2 is now member (was unknown)
> Feb  6 01:39:38 node-2 crmd[27475]:     info: crm_update_peer: Node node-2:
> id=2 state=member (new) addr=r(0) ip(9.151.142.20)  (new) votes=1 (new)
> born=904 seen=904 proc=00000000000000000000000000000000
> Feb  6 01:39:38 node-2 crmd[27475]:     info: crm_new_peer: Node node-1 now
> has id: 1
> Feb  6 01:39:38 node-2 crmd[27475]:     info: crm_new_peer: Node 1 is now
> known as node-1
> Feb  6 01:39:38 node-2 crmd[27475]:     info: ais_status_callback: status:
> node-1 is now unknown
> Feb  6 01:39:38 node-2 crmd[27475]:     info: ais_status_callback: status:
> node-1 is now member (was unknown)
> Feb  6 01:39:38 node-2 crmd[27475]:     info: crm_update_peer: Node node-1:
> id=1 state=member (new) addr=r(0) ip(9.151.142.10)  votes=1 born=896
> seen=904 proc=00000000000000000000000000000000
> Feb  6 01:39:38 node-2 crmd[27475]:     info: ais_dispatch_message:
> Membership 904: quorum retained
> Feb  6 01:39:38 node-2 crmd[27475]:   notice: do_started: The local CRM is
> operational
> Feb  6 01:39:38 node-2 crmd[27475]:   notice: do_state_transition: State
> transition S_STARTING -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL
> origin=do_started ]
> Feb  6 01:39:38 node-2 pacemakerd[27466]:   notice: update_node_processes:
> 0xb3cdd0 Node 1 now known as node-1, was:
> Feb  6 01:39:38 node-2 stonith-ng[27471]:     info: crm_new_peer: Node
> node-1 now has id: 1
> Feb  6 01:39:38 node-2 stonith-ng[27471]:     info: crm_new_peer: Node 1 is
> now known as node-1
> Feb  6 01:39:39 node-2 crmd[27475]:   notice: crmd_peer_update: Status
> update: Client node-2/crmd now has status [online] (DC=<null>)
> Feb  6 01:39:39 node-2 crmd[27475]:   notice: crmd_peer_update: Status
> update: Client node-1/crmd now has status [online] (DC=<null>)
> Feb  6 01:39:39 node-2 cib[27470]:     info: cib_replace_notify: Replaced:
> -1.-1.-1 -> 0.108.191 from node-1
> Feb  6 01:39:59 node-2 crmd[27475]:     info: crm_timer_popped: Election
> Trigger (I_DC_TIMEOUT) just popped (20000ms)
> Feb  6 01:39:59 node-2 crmd[27475]:  warning: do_log: FSA: Input
> I_DC_TIMEOUT from crm_timer_popped() received in state S_PENDING
> Feb  6 01:39:59 node-2 crmd[27475]:   notice: do_state_transition: State
> transition S_PENDING -> S_ELECTION [ input=I_DC_TIMEOUT cause=C_TIMER_POPPED
> origin=crm_timer_popped ]
> Feb  6 01:39:59 node-2 crmd[27475]:     info: do_election_count_vote:
> Election 3 (owner: node-1) lost: vote from node-1 (Uptime)
> Feb  6 01:39:59 node-2 crmd[27475]:   notice: do_state_transition: State
> transition S_ELECTION -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL
> origin=do_election_count_vote ]
> Feb  6 01:39:59 node-2 crmd[27475]:     info: do_dc_release: DC role
> released
> Feb  6 01:39:59 node-2 crmd[27475]:     info: do_te_control: Transitioner is
> now inactive
> Feb  6 01:40:19 node-2 crmd[27475]:     info: crm_timer_popped: Election
> Trigger (I_DC_TIMEOUT) just popped (20000ms)
> Feb  6 01:40:19 node-2 crmd[27475]:  warning: do_log: FSA: Input
> I_DC_TIMEOUT from crm_timer_popped() received in state S_PENDING
> Feb  6 01:40:19 node-2 crmd[27475]:   notice: do_state_transition: State
> transition S_PENDING -> S_ELECTION [ input=I_DC_TIMEOUT cause=C_TIMER_POPPED
> origin=crm_timer_popped ]
> Feb  6 01:40:19 node-2 crmd[27475]:     info: do_election_count_vote:
> Election 4 (owner: node-1) lost: vote from node-1 (Uptime)
> Feb  6 01:40:19 node-2 crmd[27475]:   notice: do_state_transition: State
> transition S_ELECTION -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL
> origin=do_election_count_vote ]
> Feb  6 01:40:19 node-2 crmd[27475]:     info: do_dc_release: DC role
> released
> Feb  6 01:40:19 node-2 crmd[27475]:     info: do_te_control: Transitioner is
> now inactive
> Feb  6 01:40:39 node-2 crmd[27475]:     info: crm_timer_popped: Election
> Trigger (I_DC_TIMEOUT) just popped (20000ms)
> Feb  6 01:40:39 node-2 crmd[27475]:  warning: do_log: FSA: Input
> I_DC_TIMEOUT from crm_timer_popped() received in state S_PENDING
> Feb  6 01:40:39 node-2 crmd[27475]:   notice: do_state_transition: State
> transition S_PENDING -> S_ELECTION [ input=I_DC_TIMEOUT cause=C_TIMER_POPPED
> origin=crm_timer_popped ]
> Feb  6 01:40:39 node-2 crmd[27475]:     info: do_election_count_vote:
> Election 5 (owner: node-1) lost: vote from node-1 (Uptime)
> Feb  6 01:40:39 node-2 crmd[27475]:   notice: do_state_transition: State
> transition S_ELECTION -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL
> origin=do_election_count_vote ]
> Feb  6 01:40:39 node-2 crmd[27475]:     info: do_dc_release: DC role
> released
> Feb  6 01:40:39 node-2 crmd[27475]:     info: do_te_control: Transitioner is
> now inactive
> Feb  6 01:40:59 node-2 crmd[27475]:     info: crm_timer_popped: Election
> Trigger (I_DC_TIMEOUT) just popped (20000ms)
> Feb  6 01:40:59 node-2 crmd[27475]:  warning: do_log: FSA: Input
> I_DC_TIMEOUT from crm_timer_popped() received in state S_PENDING
> Feb  6 01:40:59 node-2 crmd[27475]:   notice: do_state_transition: State
> transition S_PENDING -> S_ELECTION [ input=I_DC_TIMEOUT cause=C_TIMER_POPPED
> origin=crm_timer_popped ]
> Feb  6 01:40:59 node-2 crmd[27475]:     info: do_election_count_vote:
> Election 6 (owner: node-1) lost: vote from node-1 (Uptime)
> Feb  6 01:40:59 node-2 crmd[27475]:   notice: do_state_transition: State
> transition S_ELECTION -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL
> origin=do_election_count_vote ]
> Feb  6 01:40:59 node-2 crmd[27475]:     info: do_dc_release: DC role
> released
> Feb  6 01:40:59 node-2 crmd[27475]:     info: do_te_control: Transitioner is
> now inactive
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>