[ClusterLabs] Timing issue with CRMD on one node not joining cluster/remains in S_PENDING

Mon Nov 23 11:00:43 EST 2015

One note on this.

This problem doesnt happen if

Nov 19 08:36:30 node1 crmd[3298]:   notice: crmd_client_status_callback:
Status update: Client node2/crmd now has status [join] (DC=true)
Nov 19 08:36:30 node1 crmd[3298]:   notice: crmd_client_status_callback:
Status update: Client node2/crmd now has status [leave] (DC=true)
Nov 19 08:36:31 node1 crmd[3298]:   notice: crmd_client_status_callback:
Status update: Client node2/crmd now has status [join] (DC=true)

join happens before
Nov 19 08:36:34 node1 crmd[3298]:   notice: do_state_transition: State
transition S_IDLE -> S_INTEGRATION [ input=I_NODE_JOIN cause=C_FSA_INTERNAL
origin=peer_update_callback ]

I am not sure why crmd join/leave & then join happens. With previous
version of heartbeat/pacemaker this doesnt happen.

here are more logs when the problem doesnt happen

Nov 19 08:36:29 node1 heartbeat: [3143]: info: Link node2:eth0 up.
Nov 19 08:36:29 node1 heartbeat: [3143]: info: Status update for node
node2: status init
Nov 19 08:36:29 node1 crmd[3298]:   notice: crmd_ha_status_callback: Status
update: Node node2 now has status [init]
Nov 19 08:36:29 node1 crmd[3298]:   notice: crmd_ha_status_callback: Status
update: Node node2 now has status [up]
Nov 19 08:36:30 node1 heartbeat: [3143]: debug: get_delnodelist:
delnodelist=
Nov 19 08:36:30 node1 heartbeat: [3143]: info: Status update for node
node2: status active
Nov 19 08:36:30 node1 crmd[3298]:   notice: crmd_ha_status_callback: Status
update: Node node2 now has status [active]
Nov 19 08:36:30 node1 crmd[3298]:   notice: crmd_client_status_callback:
Status update: Client node2/crmd now has status [join] (DC=true)
Nov 19 08:36:30 node1 crmd[3298]:   notice: crmd_client_status_callback:
Status update: Client node2/crmd now has status [leave] (DC=true)
Nov 19 08:36:31 node1 crmd[3298]:   notice: crmd_client_status_callback:
Status update: Client node2/crmd now has status [join] (DC=true)
Nov 19 08:36:34 node1 ccm: [3293]: debug: quorum plugin: majority
Nov 19 08:36:34 node1 ccm: [3293]: debug: cluster:linux-ha, member_count=2,
member_quorum_votes=200
Nov 19 08:36:34 node1 ccm: [3293]: debug: total_node_count=2,
total_quorum_votes=200
Nov 19 08:36:34 node1 cib[3294]:   notice: crm_update_peer_state:
crm_update_ccm_node: Node node2[1] - state is now member (was (null))
Nov 19 08:36:34 node1 crmd[3298]:   notice: crm_update_peer_state:
crm_update_ccm_node: Node node2[1] - state is now member (was lost)
Nov 19 08:36:34 node1 crmd[3298]:   notice: do_state_transition: State
transition S_IDLE -> S_INTEGRATION [ input=I_NODE_JOIN cause=C_FSA_INTERNAL
origin=peer_update_callback ]

after the above, resources are started alright
Nov 19 08:36:40 node1 attrd[3297]:   notice: attrd_local_callback: Sending
full refresh (origin=crmd)
Nov 19 08:36:40 node1 attrd[3297]:   notice: attrd_trigger_update: Sending
flush op to all hosts for: probe_complete (true)
Nov 19 08:36:40 node1 attrd[3297]:   notice: attrd_trigger_update: Sending
flush op to all hosts for: master-MYSQL:0 (100)
Nov 19 08:36:41 node1 pengine[3299]:   notice: unpack_config: On loss of
CCM Quorum: Ignore
Nov 19 08:36:41 node1 pengine[3299]:   notice: LogActions: Start
IPADDR:1#011(node2)

Any help/pointers greatly apprecited.

Thanks.

--Shyam

On Mon, Nov 23, 2015 at 12:14 PM, Shyam <shyam.kaushik at gmail.com> wrote:

> Hi All,
>
> Need a help w.r.t. a timing issue that I hit when using pacemaker-1.1.10
> with heartbeat-3.0.5 on Ubuntu Trusty.
>
> As seen in logs below, issue appears to be w.r.t. membership that CRMD
> looks at that it wrongly moves to INTEGRATION phase when the peer CRMD
> hasnt joined the cluster fully yet.
>
> We have a 2 node cluster & occasionally when one of the node comes up
> after reboot it keeps remaining in S_PENDING state/until
> heartbeat/pacemaker on that node is restarted. This happens periodically &
> not always.
>
> Below logs shows the problem clearly
>
> ndoe1 has been running & sees node2 coming up
>
> Nov 20 09:09:10 node1 crmd[3905]:   notice: crmd_ha_status_callback:
> Status update: Node node2 now has status [init]
> Nov 20 09:09:10 node1 crmd[3905]:   notice: crmd_ha_status_callback:
> Status update: Node node2 now has status [up]
> Nov 20 09:09:10 node1 heartbeat: [2943]: info: Link node2:eth0 up.
> Nov 20 09:09:10 node1 heartbeat: [2943]: info: Status update for node
> node2: status init
> Nov 20 09:09:10 node1 heartbeat: [2943]: info: Status update for node
> node2: status up
> Nov 20 09:09:10 node1 heartbeat: [2943]: debug: get_delnodelist:
> delnodelist=
> Nov 20 09:09:10 node1 heartbeat: [2943]: info: all clients are now paused
> Nov 20 09:09:10 node1 crmd[3905]:   notice: crmd_ha_status_callback:
> Status update: Node node2 now has status [active]
> Nov 20 09:09:10 node1 crmd[3905]:   notice: crmd_client_status_callback:
> Status update: Client node2/crmd now has status [join] (DC=true)
> Nov 20 09:09:10 node1 heartbeat: [2943]: info: Status update for node
> node2: status active
> Nov 20 09:09:10 node1 crmd[3905]:   notice: crmd_client_status_callback:
> Status update: Client node2/crmd now has status [leave] (DC=true)
> Nov 20 09:09:11 node1 heartbeat: [2943]: info: all clients are now resumed
>
>
> as can be seen above, w.r.t. CRMD on peer, it registered at heartbeat &
> de-registered (not sure why)
> Nov 20 09:09:10 node1 crmd[3905]:   notice: crmd_client_status_callback:
> Status update: Client node2/crmd now has status [join] (DC=true)
> Nov 20 09:09:10 node1 crmd[3905]:   notice: crmd_client_status_callback:
> Status update: Client node2/crmd now has status [leave] (DC=true)
>
>
> while this is the state CRMD on node1 moves to INTEGRATION phase & there
> are join requests flowing through, but presume I cannot be handled by peer
>
> Nov 20 09:09:13 node1 cib[3901]:   notice: crm_update_peer_state:
> crm_update_ccm_node: Node node2[0] - state is now member (was (null))
> Nov 20 09:09:13 node1 crmd[3905]:   notice: crm_update_peer_state:
> crm_update_ccm_node: Node node2[0] - state is now member (was lost)
> Nov 20 09:09:13 node1 crmd[3905]:   notice: do_state_transition: State
> transition S_IDLE -> S_INTEGRATION [ input=I_NODE_JOIN cause=C_FSA_INTERNAL
> origin=peer_update_callback ]
> Nov 20 09:09:13 node1 ccm: [3900]: debug: quorum plugin: majority
> Nov 20 09:09:13 node1 ccm: [3900]: debug: cluster:linux-ha,
> member_count=2, member_quorum_votes=200
> Nov 20 09:09:13 node1 ccm: [3900]: debug: total_node_count=2,
> total_quorum_votes=200
> Nov 20 09:09:13 node1 crmd[3905]:   notice: crmd_client_status_callback:
> Status update: Client node2/crmd now has status [join] (DC=true)
> Nov 20 09:09:13 node1 crmd[3905]:  warning: match_down_event: No match for
> shutdown action on 00000432-0432-0000-2b91-000000000000
> Nov 20 09:09:13 node1 crmd[3905]:   notice: peer_update_callback:
> Stonith/shutdown of node2 not matched
> Nov 20 09:09:13 node1 attrd[3904]:   notice: attrd_local_callback: Sending
> full refresh (origin=crmd)
> Nov 20 09:09:13 node1 attrd[3904]:   notice: attrd_trigger_update: Sending
> flush op to all hosts for: probe_complete (true)
> Nov 20 09:09:13 node1 attrd[3904]:   notice: attrd_trigger_update: Sending
> flush op to all hosts for: master-MYSQL:0 (100)
> Nov 20 09:09:13 node1 pengine[3906]:   notice: unpack_config: On loss of
> CCM Quorum: Ignore
> Nov 20 09:09:13 node1 pengine[3906]:  warning: custom_action: Action
> IPADDR:0_monitor_0 on node2 is unrunnable (pending)
>
>
> As can be seen we had
> Nov 20 09:09:13 node1 crmd[3905]:   notice: do_state_transition: State
> transition S_IDLE -> S_INTEGRATION [ input=I_NODE_JOIN cause=C_FSA_INTERNAL
>
> & only then CRMD on peer seems to be alright
> Nov 20 09:09:13 node1 crmd[3905]:   notice: crmd_client_status_callback:
> Status update: Client node2/crmd now has status [join] (DC=true)
>
> so pengine decided not being able to run any resources & we are stuck in
> this state. While node2 has this trace
> Nov 20 09:09:13 node2 cib[2964]:   notice: cib_server_process_diff: Not
> applying diff 1.3.33 -> 1.3.34 (sync in progress)
> Nov 20 09:09:13 node2 cib[2964]:   notice: crm_update_peer_state:
> crm_update_ccm_node: Node node1[1] - state is now member (was (null))
> Nov 20 09:09:13 node2 cib[2964]:   notice: crm_update_peer_state:
> crm_update_ccm_node: Node node2[0] - state is now member (was (null))
> Nov 20 09:09:13 node2 crmd[2968]:   notice: crm_cluster_connect:
> Connecting to cluster infrastructure: heartbeat
> Nov 20 09:09:14 node2 heartbeat: [2938]: info: the send queue length from
> heartbeat to client crmd is set to 1024
> Nov 20 09:09:15 node2 crmd[2968]:   notice: crmd_client_status_callback:
> Status update: Client node2/crmd now has status [join] (DC=false)
> Nov 20 09:09:15 node2 crmd[2968]:   notice: crmd_client_status_callback:
> Status update: Client node2/crmd now has status [online] (DC=false)
> Nov 20 09:09:15 node2 crmd[2968]:   notice: crmd_client_status_callback:
> Status update: Client node1/crmd now has status [online] (DC=false)
> Nov 20 09:09:16 node2 crmd[2968]:   notice: crm_update_peer_state:
> crm_update_ccm_node: Node node1[1] - state is now member (was (null))
> Nov 20 09:09:16 node2 crmd[2968]:   notice: crm_update_peer_state:
> crm_update_ccm_node: Node node2[0] - state is now member (was (null))
> Nov 20 09:09:16 node2 crmd[2968]:   notice: do_started: The local CRM is
> operational
> Nov 20 09:09:16 node2 crmd[2968]:   notice: do_state_transition: State
> transition S_STARTING -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL
> origin=do_started ]
> Nov 20 09:09:27 node2 crmd[2968]:  warning: do_log: FSA: Input
> I_DC_TIMEOUT from crm_timer_popped() received in state S_PENDING
>
>
> i.e. it moves to S_PENDING state & keeps receiving I_DC_TIMEOUT
>
> I tried using latest heartbeat, but that doesnt seem to be the problem.
> Can anyone suggest if this issue has already been fixed in latest pacemaker
> or any other suggestions how to debug this issue?
>
> If I enable higher debug level (both in heartbeat/pacemaker), this problem
> doesnt show up. Any help/pointers on how to go forward is greatly
> appreciated.
>
> Thanks!
>
> --Shyam
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20151123/ae8ee5a8/attachment-0003.html>