[ClusterLabs] Timing issue with CRMD on one node not joining cluster/remains in S_PENDING

Mon Nov 23 01:44:17 EST 2015

Hi All,

Need a help w.r.t. a timing issue that I hit when using pacemaker-1.1.10
with heartbeat-3.0.5 on Ubuntu Trusty.

As seen in logs below, issue appears to be w.r.t. membership that CRMD
looks at that it wrongly moves to INTEGRATION phase when the peer CRMD
hasnt joined the cluster fully yet.

We have a 2 node cluster & occasionally when one of the node comes up after
reboot it keeps remaining in S_PENDING state/until heartbeat/pacemaker on
that node is restarted. This happens periodically & not always.

Below logs shows the problem clearly

ndoe1 has been running & sees node2 coming up

Nov 20 09:09:10 node1 crmd[3905]:   notice: crmd_ha_status_callback: Status
update: Node node2 now has status [init]
Nov 20 09:09:10 node1 crmd[3905]:   notice: crmd_ha_status_callback: Status
update: Node node2 now has status [up]
Nov 20 09:09:10 node1 heartbeat: [2943]: info: Link node2:eth0 up.
Nov 20 09:09:10 node1 heartbeat: [2943]: info: Status update for node
node2: status init
Nov 20 09:09:10 node1 heartbeat: [2943]: info: Status update for node
node2: status up
Nov 20 09:09:10 node1 heartbeat: [2943]: debug: get_delnodelist:
delnodelist=
Nov 20 09:09:10 node1 heartbeat: [2943]: info: all clients are now paused
Nov 20 09:09:10 node1 crmd[3905]:   notice: crmd_ha_status_callback: Status
update: Node node2 now has status [active]
Nov 20 09:09:10 node1 crmd[3905]:   notice: crmd_client_status_callback:
Status update: Client node2/crmd now has status [join] (DC=true)
Nov 20 09:09:10 node1 heartbeat: [2943]: info: Status update for node
node2: status active
Nov 20 09:09:10 node1 crmd[3905]:   notice: crmd_client_status_callback:
Status update: Client node2/crmd now has status [leave] (DC=true)
Nov 20 09:09:11 node1 heartbeat: [2943]: info: all clients are now resumed

as can be seen above, w.r.t. CRMD on peer, it registered at heartbeat &
de-registered (not sure why)
Nov 20 09:09:10 node1 crmd[3905]:   notice: crmd_client_status_callback:
Status update: Client node2/crmd now has status [join] (DC=true)
Nov 20 09:09:10 node1 crmd[3905]:   notice: crmd_client_status_callback:
Status update: Client node2/crmd now has status [leave] (DC=true)

while this is the state CRMD on node1 moves to INTEGRATION phase & there
are join requests flowing through, but presume I cannot be handled by peer

Nov 20 09:09:13 node1 cib[3901]:   notice: crm_update_peer_state:
crm_update_ccm_node: Node node2[0] - state is now member (was (null))
Nov 20 09:09:13 node1 crmd[3905]:   notice: crm_update_peer_state:
crm_update_ccm_node: Node node2[0] - state is now member (was lost)
Nov 20 09:09:13 node1 crmd[3905]:   notice: do_state_transition: State
transition S_IDLE -> S_INTEGRATION [ input=I_NODE_JOIN cause=C_FSA_INTERNAL
origin=peer_update_callback ]
Nov 20 09:09:13 node1 ccm: [3900]: debug: quorum plugin: majority
Nov 20 09:09:13 node1 ccm: [3900]: debug: cluster:linux-ha, member_count=2,
member_quorum_votes=200
Nov 20 09:09:13 node1 ccm: [3900]: debug: total_node_count=2,
total_quorum_votes=200
Nov 20 09:09:13 node1 crmd[3905]:   notice: crmd_client_status_callback:
Status update: Client node2/crmd now has status [join] (DC=true)
Nov 20 09:09:13 node1 crmd[3905]:  warning: match_down_event: No match for
shutdown action on 00000432-0432-0000-2b91-000000000000
Nov 20 09:09:13 node1 crmd[3905]:   notice: peer_update_callback:
Stonith/shutdown of node2 not matched
Nov 20 09:09:13 node1 attrd[3904]:   notice: attrd_local_callback: Sending
full refresh (origin=crmd)
Nov 20 09:09:13 node1 attrd[3904]:   notice: attrd_trigger_update: Sending
flush op to all hosts for: probe_complete (true)
Nov 20 09:09:13 node1 attrd[3904]:   notice: attrd_trigger_update: Sending
flush op to all hosts for: master-MYSQL:0 (100)
Nov 20 09:09:13 node1 pengine[3906]:   notice: unpack_config: On loss of
CCM Quorum: Ignore
Nov 20 09:09:13 node1 pengine[3906]:  warning: custom_action: Action
IPADDR:0_monitor_0 on node2 is unrunnable (pending)

As can be seen we had
Nov 20 09:09:13 node1 crmd[3905]:   notice: do_state_transition: State
transition S_IDLE -> S_INTEGRATION [ input=I_NODE_JOIN cause=C_FSA_INTERNAL

& only then CRMD on peer seems to be alright
Nov 20 09:09:13 node1 crmd[3905]:   notice: crmd_client_status_callback:
Status update: Client node2/crmd now has status [join] (DC=true)

so pengine decided not being able to run any resources & we are stuck in
this state. While node2 has this trace
Nov 20 09:09:13 node2 cib[2964]:   notice: cib_server_process_diff: Not
applying diff 1.3.33 -> 1.3.34 (sync in progress)
Nov 20 09:09:13 node2 cib[2964]:   notice: crm_update_peer_state:
crm_update_ccm_node: Node node1[1] - state is now member (was (null))
Nov 20 09:09:13 node2 cib[2964]:   notice: crm_update_peer_state:
crm_update_ccm_node: Node node2[0] - state is now member (was (null))
Nov 20 09:09:13 node2 crmd[2968]:   notice: crm_cluster_connect: Connecting
to cluster infrastructure: heartbeat
Nov 20 09:09:14 node2 heartbeat: [2938]: info: the send queue length from
heartbeat to client crmd is set to 1024
Nov 20 09:09:15 node2 crmd[2968]:   notice: crmd_client_status_callback:
Status update: Client node2/crmd now has status [join] (DC=false)
Nov 20 09:09:15 node2 crmd[2968]:   notice: crmd_client_status_callback:
Status update: Client node2/crmd now has status [online] (DC=false)
Nov 20 09:09:15 node2 crmd[2968]:   notice: crmd_client_status_callback:
Status update: Client node1/crmd now has status [online] (DC=false)
Nov 20 09:09:16 node2 crmd[2968]:   notice: crm_update_peer_state:
crm_update_ccm_node: Node node1[1] - state is now member (was (null))
Nov 20 09:09:16 node2 crmd[2968]:   notice: crm_update_peer_state:
crm_update_ccm_node: Node node2[0] - state is now member (was (null))
Nov 20 09:09:16 node2 crmd[2968]:   notice: do_started: The local CRM is
operational
Nov 20 09:09:16 node2 crmd[2968]:   notice: do_state_transition: State
transition S_STARTING -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL
origin=do_started ]
Nov 20 09:09:27 node2 crmd[2968]:  warning: do_log: FSA: Input I_DC_TIMEOUT
from crm_timer_popped() received in state S_PENDING

i.e. it moves to S_PENDING state & keeps receiving I_DC_TIMEOUT

I tried using latest heartbeat, but that doesnt seem to be the problem. Can
anyone suggest if this issue has already been fixed in latest pacemaker or
any other suggestions how to debug this issue?

If I enable higher debug level (both in heartbeat/pacemaker), this problem
doesnt show up. Any help/pointers on how to go forward is greatly
appreciated.

Thanks!

--Shyam
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20151123/c109acb1/attachment-0002.html>