[Pacemaker] Node fails to rejoin cluster

Tal Yalon yalon at infinidat.com
Wed Feb 6 05:11:10 EST 2013


Hi all,

I have a 2-node cluster, where node-2 got fenced and now after reboot tries
to rejoin the cluster but fails and gets stuck in a loop for hours and
never joins back.

After another reboot it managed to join, and there was no time difference
between the nodes.

Below is corosync/pacemaker log of node-2 (the one that was stuck in the
loop). Any help would be appreciated, since I have no clue as to what
happened.

Thanks,
Tal


Feb  6 01:39:32 node-2 corosync[27428]:   [MAIN  ] Corosync Cluster Engine
('1.4.1'): started and ready to provide service.
Feb  6 01:39:32 node-2 corosync[27428]:   [MAIN  ] Corosync built-in
features: nss dbus rdma snmp
Feb  6 01:39:32 node-2 corosync[27428]:   [MAIN  ] Successfully read main
configuration file '/etc/corosync/corosync.conf'.
Feb  6 01:39:32 node-2 corosync[27428]:   [TOTEM ] Initializing transport
(UDP/IP Unicast).
Feb  6 01:39:32 node-2 corosync[27428]:   [TOTEM ] Initializing
transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
Feb  6 01:39:32 node-2 corosync[27428]:   [TOTEM ] The network interface
[9.151.142.20] is now up.
Feb  6 01:39:32 node-2 corosync[27428]:   [pcmk  ] Logging: Initialized
pcmk_startup
Feb  6 01:39:32 node-2 corosync[27428]:   [SERV  ] Service engine loaded:
Pacemaker Cluster Manager 1.1.6
Feb  6 01:39:32 node-2 corosync[27428]:   [SERV  ] Service engine loaded:
corosync extended virtual synchrony service
Feb  6 01:39:32 node-2 corosync[27428]:   [SERV  ] Service engine loaded:
corosync configuration service
Feb  6 01:39:32 node-2 corosync[27428]:   [SERV  ] Service engine loaded:
corosync cluster closed process group service v1.01
Feb  6 01:39:32 node-2 corosync[27428]:   [SERV  ] Service engine loaded:
corosync cluster config database access v1.01
Feb  6 01:39:32 node-2 corosync[27428]:   [SERV  ] Service engine loaded:
corosync profile loading service
Feb  6 01:39:32 node-2 corosync[27428]:   [SERV  ] Service engine loaded:
corosync cluster quorum service v0.1
Feb  6 01:39:32 node-2 corosync[27428]:   [MAIN  ] Compatibility mode set
to whitetank.  Using V1 and V2 of the synchronization engine.
Feb  6 01:39:32 node-2 corosync[27428]:   [TOTEM ] A processor joined or
left the membership and a new membership was formed.
Feb  6 01:39:32 node-2 corosync[27428]:   [CPG   ] chosen downlist: sender
r(0) ip(9.151.142.20) ; members(old:0 left:0)
Feb  6 01:39:32 node-2 corosync[27428]:   [MAIN  ] Completed service
synchronization, ready to provide service.
Feb  6 01:39:37 node-2 pacemakerd[27466]:     info: crm_log_init_worker:
Changed active directory to /var/lib/heartbeat/cores/root
Feb  6 01:39:37 node-2 pacemakerd[27466]:   notice: main: Starting
Pacemaker 1.1.7-6.el6 (Build: 148fccfd5985c5590cc601123c6c16e966b85d14):
 generated-manpages agent-manpages ascii-docs publican-docs ncurses
trace-logging libqb  corosync-plugin cman
Feb  6 01:39:37 node-2 pacemakerd[27466]:     info: main: Maximum core file
size is: 18446744073709551615
Feb  6 01:39:37 node-2 pacemakerd[27466]:   notice: update_node_processes:
0xb31fe0 Node 2 now known as node-2, was:
Feb  6 01:39:37 node-2 pacemakerd[27466]:     info: start_child: Forked
child 27470 for process cib
Feb  6 01:39:37 node-2 pacemakerd[27466]:     info: start_child: Forked
child 27471 for process stonith-ng
Feb  6 01:39:37 node-2 pacemakerd[27466]:     info: start_child: Forked
child 27472 for process lrmd
Feb  6 01:39:37 node-2 pacemakerd[27466]:     info: start_child: Forked
child 27473 for process attrd
Feb  6 01:39:37 node-2 pacemakerd[27466]:     info: start_child: Forked
child 27474 for process pengine
Feb  6 01:39:37 node-2 pacemakerd[27466]:     info: start_child: Forked
child 27475 for process crmd
Feb  6 01:39:37 node-2 pacemakerd[27466]:     info: main: Starting mainloop
Feb  6 01:39:37 node-2 lrmd: [27472]: info: G_main_add_SignalHandler: Added
signal handler for signal 15
Feb  6 01:39:37 node-2 stonith-ng[27471]:     info: crm_log_init_worker:
Changed active directory to /var/lib/heartbeat/cores/root
Feb  6 01:39:37 node-2 stonith-ng[27471]:     info: get_cluster_type:
Cluster type is: 'openais'
Feb  6 01:39:37 node-2 stonith-ng[27471]:   notice: crm_cluster_connect:
Connecting to cluster infrastructure: classic openais (with plugin)
Feb  6 01:39:37 node-2 stonith-ng[27471]:     info:
init_ais_connection_classic: Creating connection to our Corosync plugin
Feb  6 01:39:37 node-2 stonith-ng[27471]:     info:
init_ais_connection_classic: AIS connection established
Feb  6 01:39:37 node-2 stonith-ng[27471]:     info: get_ais_nodeid: Server
details: id=2 uname=node-2 cname=pcmk
Feb  6 01:39:37 node-2 stonith-ng[27471]:     info:
init_ais_connection_once: Connection to 'classic openais (with plugin)':
established
Feb  6 01:39:37 node-2 stonith-ng[27471]:     info: crm_new_peer: Node
node-2 now has id: 2
Feb  6 01:39:37 node-2 stonith-ng[27471]:     info: crm_new_peer: Node 2 is
now known as node-2
Feb  6 01:39:37 node-2 crmd[27475]:     info: crm_log_init_worker: Changed
active directory to /var/lib/heartbeat/cores/hacluster
Feb  6 01:39:37 node-2 crmd[27475]:   notice: main: CRM Hg Version:
148fccfd5985c5590cc601123c6c16e966b85d14
Feb  6 01:39:37 node-2 lrmd: [27472]: info: G_main_add_SignalHandler: Added
signal handler for signal 17
Feb  6 01:39:37 node-2 lrmd: [27472]: info: enabling coredumps
Feb  6 01:39:37 node-2 lrmd: [27472]: info: G_main_add_SignalHandler: Added
signal handler for signal 10
Feb  6 01:39:37 node-2 lrmd: [27472]: info: G_main_add_SignalHandler: Added
signal handler for signal 12
Feb  6 01:39:37 node-2 lrmd: [27472]: info: Started.
Feb  6 01:39:37 node-2 cib[27470]:     info: crm_log_init_worker: Changed
active directory to /var/lib/heartbeat/cores/hacluster
Feb  6 01:39:37 node-2 cib[27470]:     info: retrieveCib: Reading cluster
configuration from: /var/lib/heartbeat/crm/cib.xml (digest:
/var/lib/heartbeat/crm/cib.xml.sig)
Feb  6 01:39:37 node-2 attrd[27473]:   notice: crm_cluster_connect:
Connecting to cluster infrastructure: classic openais (with plugin)
Feb  6 01:39:37 node-2 attrd[27473]:   notice: main: Starting mainloop...
Feb  6 01:39:37 node-2 cib[27470]:     info: validate_with_relaxng:
Creating RNG parser context
Feb  6 01:39:37 node-2 cib[27470]:     info: startCib: CIB Initialization
completed successfully
Feb  6 01:39:37 node-2 cib[27470]:     info: get_cluster_type: Cluster type
is: 'openais'
Feb  6 01:39:37 node-2 cib[27470]:   notice: crm_cluster_connect:
Connecting to cluster infrastructure: classic openais (with plugin)
Feb  6 01:39:37 node-2 cib[27470]:     info: init_ais_connection_classic:
Creating connection to our Corosync plugin
Feb  6 01:39:37 node-2 cib[27470]:     info: init_ais_connection_classic:
AIS connection established
Feb  6 01:39:37 node-2 cib[27470]:     info: get_ais_nodeid: Server
details: id=2 uname=node-2 cname=pcmk
Feb  6 01:39:37 node-2 cib[27470]:     info: init_ais_connection_once:
Connection to 'classic openais (with plugin)': established
Feb  6 01:39:37 node-2 cib[27470]:     info: crm_new_peer: Node node-2 now
has id: 2
Feb  6 01:39:37 node-2 cib[27470]:     info: crm_new_peer: Node 2 is now
known as node-2
Feb  6 01:39:37 node-2 cib[27470]:     info: cib_init: Starting cib mainloop
Feb  6 01:39:37 node-2 cib[27470]:     info: ais_dispatch_message:
Membership 900: quorum still lost
Feb  6 01:39:37 node-2 cib[27470]:     info: crm_update_peer: Node node-2:
id=2 state=member (new) addr=r(0) ip(9.151.142.20)  (new) votes=1 (new)
born=0 seen=900 proc=00000000000000000000000000000000
Feb  6 01:39:37 node-2 cib[27470]:     info: ais_dispatch_message:
Membership 904: quorum still lost
Feb  6 01:39:37 node-2 cib[27470]:     info: crm_new_peer: Node <null> now
has id: 1
Feb  6 01:39:37 node-2 cib[27470]:     info: crm_update_peer: Node (null):
id=1 state=member (new) addr=r(0) ip(9.151.142.10)  votes=0 born=0 seen=904
proc=00000000000000000000000000000000
Feb  6 01:39:37 node-2 corosync[27428]:   [TOTEM ] A processor joined or
left the membership and a new membership was formed.
Feb  6 01:39:37 node-2 cib[27470]:   notice: ais_dispatch_message:
Membership 904: quorum acquired
Feb  6 01:39:37 node-2 cib[27470]:     info: crm_get_peer: Node 1 is now
known as node-1
Feb  6 01:39:37 node-2 cib[27470]:     info: crm_update_peer: Node node-1:
id=1 state=member addr=r(0) ip(9.151.142.10)  votes=1 (new) born=896
seen=904 proc=00000000000000000000000000000000
Feb  6 01:39:37 node-2 corosync[27428]:   [CPG   ] chosen downlist: sender
r(0) ip(9.151.142.10) ; members(old:1 left:0)
Feb  6 01:39:37 node-2 corosync[27428]:   [MAIN  ] Completed service
synchronization, ready to provide service.
Feb  6 01:39:38 node-2 cib[27470]:  warning: cib_peer_callback: Discarding
cib_apply_diff message (30) from node-1: not in our membership
Feb  6 01:39:38 node-2 cib[27470]:  warning: cib_peer_callback: Discarding
cib_apply_diff message (31) from node-1: not in our membership
Feb  6 01:39:38 node-2 cib[27470]:  warning: cib_peer_callback: Discarding
cib_apply_diff message (32) from node-1: not in our membership
Feb  6 01:39:38 node-2 cib[27470]:  warning: cib_peer_callback: Discarding
cib_apply_diff message (33) from node-1: not in our membership
Feb  6 01:39:38 node-2 cib[27470]:  warning: cib_peer_callback: Discarding
cib_apply_diff message (34) from node-1: not in our membership
Feb  6 01:39:38 node-2 cib[27470]:  warning: cib_peer_callback: Discarding
cib_apply_diff message (35) from node-1: not in our membership
Feb  6 01:39:38 node-2 cib[27470]:  warning: cib_peer_callback: Discarding
cib_apply_diff message (36) from node-1: not in our membership
Feb  6 01:39:38 node-2 cib[27470]:  warning: cib_peer_callback: Discarding
cib_apply_diff message (37) from node-1: not in our membership
Feb  6 01:39:38 node-2 cib[27470]:  warning: cib_peer_callback: Discarding
cib_apply_diff message (38) from node-1: not in our membership
Feb  6 01:39:38 node-2 stonith-ng[27471]:   notice: setup_cib: Watching for
stonith topology changes
Feb  6 01:39:38 node-2 stonith-ng[27471]:     info: main: Starting
stonith-ng mainloop
Feb  6 01:39:38 node-2 crmd[27475]:     info: do_cib_control: CIB
connection established
Feb  6 01:39:38 node-2 crmd[27475]:     info: get_cluster_type: Cluster
type is: 'openais'
Feb  6 01:39:38 node-2 crmd[27475]:   notice: crm_cluster_connect:
Connecting to cluster infrastructure: classic openais (with plugin)
Feb  6 01:39:38 node-2 crmd[27475]:     info: init_ais_connection_classic:
Creating connection to our Corosync plugin
Feb  6 01:39:38 node-2 crmd[27475]:     info: init_ais_connection_classic:
AIS connection established
Feb  6 01:39:38 node-2 crmd[27475]:     info: get_ais_nodeid: Server
details: id=2 uname=node-2 cname=pcmk
Feb  6 01:39:38 node-2 crmd[27475]:     info: init_ais_connection_once:
Connection to 'classic openais (with plugin)': established
Feb  6 01:39:38 node-2 crmd[27475]:     info: crm_new_peer: Node node-2 now
has id: 2
Feb  6 01:39:38 node-2 crmd[27475]:     info: crm_new_peer: Node 2 is now
known as node-2
Feb  6 01:39:38 node-2 crmd[27475]:     info: ais_status_callback: status:
node-2 is now unknown
Feb  6 01:39:38 node-2 crmd[27475]:     info: do_ha_control: Connected to
the cluster
Feb  6 01:39:38 node-2 crmd[27475]:     info: do_started: Delaying start,
no membership data (0000000000100000)
Feb  6 01:39:38 node-2 crmd[27475]:   notice: ais_dispatch_message:
Membership 904: quorum acquired
Feb  6 01:39:38 node-2 crmd[27475]:     info: ais_status_callback: status:
node-2 is now member (was unknown)
Feb  6 01:39:38 node-2 crmd[27475]:     info: crm_update_peer: Node node-2:
id=2 state=member (new) addr=r(0) ip(9.151.142.20)  (new) votes=1 (new)
born=904 seen=904 proc=00000000000000000000000000000000
Feb  6 01:39:38 node-2 crmd[27475]:     info: crm_new_peer: Node node-1 now
has id: 1
Feb  6 01:39:38 node-2 crmd[27475]:     info: crm_new_peer: Node 1 is now
known as node-1
Feb  6 01:39:38 node-2 crmd[27475]:     info: ais_status_callback: status:
node-1 is now unknown
Feb  6 01:39:38 node-2 crmd[27475]:     info: ais_status_callback: status:
node-1 is now member (was unknown)
Feb  6 01:39:38 node-2 crmd[27475]:     info: crm_update_peer: Node node-1:
id=1 state=member (new) addr=r(0) ip(9.151.142.10)  votes=1 born=896
seen=904 proc=00000000000000000000000000000000
Feb  6 01:39:38 node-2 crmd[27475]:     info: ais_dispatch_message:
Membership 904: quorum retained
Feb  6 01:39:38 node-2 crmd[27475]:   notice: do_started: The local CRM is
operational
Feb  6 01:39:38 node-2 crmd[27475]:   notice: do_state_transition: State
transition S_STARTING -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL
origin=do_started ]
Feb  6 01:39:38 node-2 pacemakerd[27466]:   notice: update_node_processes:
0xb3cdd0 Node 1 now known as node-1, was:
Feb  6 01:39:38 node-2 stonith-ng[27471]:     info: crm_new_peer: Node
node-1 now has id: 1
Feb  6 01:39:38 node-2 stonith-ng[27471]:     info: crm_new_peer: Node 1 is
now known as node-1
Feb  6 01:39:39 node-2 crmd[27475]:   notice: crmd_peer_update: Status
update: Client node-2/crmd now has status [online] (DC=<null>)
Feb  6 01:39:39 node-2 crmd[27475]:   notice: crmd_peer_update: Status
update: Client node-1/crmd now has status [online] (DC=<null>)
Feb  6 01:39:39 node-2 cib[27470]:     info: cib_replace_notify: Replaced:
-1.-1.-1 -> 0.108.191 from node-1
Feb  6 01:39:59 node-2 crmd[27475]:     info: crm_timer_popped: Election
Trigger (I_DC_TIMEOUT) just popped (20000ms)
Feb  6 01:39:59 node-2 crmd[27475]:  warning: do_log: FSA: Input
I_DC_TIMEOUT from crm_timer_popped() received in state S_PENDING
Feb  6 01:39:59 node-2 crmd[27475]:   notice: do_state_transition: State
transition S_PENDING -> S_ELECTION [ input=I_DC_TIMEOUT
cause=C_TIMER_POPPED origin=crm_timer_popped ]
Feb  6 01:39:59 node-2 crmd[27475]:     info: do_election_count_vote:
Election 3 (owner: node-1) lost: vote from node-1 (Uptime)
Feb  6 01:39:59 node-2 crmd[27475]:   notice: do_state_transition: State
transition S_ELECTION -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL
origin=do_election_count_vote ]
Feb  6 01:39:59 node-2 crmd[27475]:     info: do_dc_release: DC role
released
Feb  6 01:39:59 node-2 crmd[27475]:     info: do_te_control: Transitioner
is now inactive
Feb  6 01:40:19 node-2 crmd[27475]:     info: crm_timer_popped: Election
Trigger (I_DC_TIMEOUT) just popped (20000ms)
Feb  6 01:40:19 node-2 crmd[27475]:  warning: do_log: FSA: Input
I_DC_TIMEOUT from crm_timer_popped() received in state S_PENDING
Feb  6 01:40:19 node-2 crmd[27475]:   notice: do_state_transition: State
transition S_PENDING -> S_ELECTION [ input=I_DC_TIMEOUT
cause=C_TIMER_POPPED origin=crm_timer_popped ]
Feb  6 01:40:19 node-2 crmd[27475]:     info: do_election_count_vote:
Election 4 (owner: node-1) lost: vote from node-1 (Uptime)
Feb  6 01:40:19 node-2 crmd[27475]:   notice: do_state_transition: State
transition S_ELECTION -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL
origin=do_election_count_vote ]
Feb  6 01:40:19 node-2 crmd[27475]:     info: do_dc_release: DC role
released
Feb  6 01:40:19 node-2 crmd[27475]:     info: do_te_control: Transitioner
is now inactive
Feb  6 01:40:39 node-2 crmd[27475]:     info: crm_timer_popped: Election
Trigger (I_DC_TIMEOUT) just popped (20000ms)
Feb  6 01:40:39 node-2 crmd[27475]:  warning: do_log: FSA: Input
I_DC_TIMEOUT from crm_timer_popped() received in state S_PENDING
Feb  6 01:40:39 node-2 crmd[27475]:   notice: do_state_transition: State
transition S_PENDING -> S_ELECTION [ input=I_DC_TIMEOUT
cause=C_TIMER_POPPED origin=crm_timer_popped ]
Feb  6 01:40:39 node-2 crmd[27475]:     info: do_election_count_vote:
Election 5 (owner: node-1) lost: vote from node-1 (Uptime)
Feb  6 01:40:39 node-2 crmd[27475]:   notice: do_state_transition: State
transition S_ELECTION -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL
origin=do_election_count_vote ]
Feb  6 01:40:39 node-2 crmd[27475]:     info: do_dc_release: DC role
released
Feb  6 01:40:39 node-2 crmd[27475]:     info: do_te_control: Transitioner
is now inactive
Feb  6 01:40:59 node-2 crmd[27475]:     info: crm_timer_popped: Election
Trigger (I_DC_TIMEOUT) just popped (20000ms)
Feb  6 01:40:59 node-2 crmd[27475]:  warning: do_log: FSA: Input
I_DC_TIMEOUT from crm_timer_popped() received in state S_PENDING
Feb  6 01:40:59 node-2 crmd[27475]:   notice: do_state_transition: State
transition S_PENDING -> S_ELECTION [ input=I_DC_TIMEOUT
cause=C_TIMER_POPPED origin=crm_timer_popped ]
Feb  6 01:40:59 node-2 crmd[27475]:     info: do_election_count_vote:
Election 6 (owner: node-1) lost: vote from node-1 (Uptime)
Feb  6 01:40:59 node-2 crmd[27475]:   notice: do_state_transition: State
transition S_ELECTION -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL
origin=do_election_count_vote ]
Feb  6 01:40:59 node-2 crmd[27475]:     info: do_dc_release: DC role
released
Feb  6 01:40:59 node-2 crmd[27475]:     info: do_te_control: Transitioner
is now inactive
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/pacemaker/attachments/20130206/145f4e08/attachment-0002.html>


More information about the Pacemaker mailing list