[ClusterLabs] BUG in crmd/ccm membership handling when crmd registers with cluster a bit late

Sat Nov 28 20:25:20 UTC 2015

Hi Lars,

Thanks for your reply. We used pacemaker-1.1.10/heartbeat-3.0.5 as
unfortunately thats what Ubuntu Trusty LTS comes up with. In fact Ubuntu
xenial still has heartbeat-3.0.5

We will move to heartbeat-3.0.6/pacemaker-1.1.13 like you suggest. One
question for you.

> Also, if you want to do Pacemaker with "older" heartbeat,
> you have to go Pacemaker 1.1.6 or older,
> or what you observe here is just one (small) issue.
> 1.1.7 up until 1.1.13
> (or, 1.1.12 + linbit patches, which have then been merged into 1.1.13)
> do not work properly with heartbeat, due to changes in pacemaker
> upstream which have been incompatible with the heartbeat messaging and
> membership layer (especially when it comes to fencing/stonith).

Are you saying that pacemaker-1.1.13 will work alright with heartbeat-3.0.6
or are there known issues that we must go back to pacemaker-1.1.6?

At this point it will be hard for us to shift to corosync & will be better
to continue with heartbeat. So your inputs are greatly appreciated. Thank
you!

--Shyam

> Date: Fri, 27 Nov 2015 12:58:15 +0100
> From: Lars Ellenberg <lars.ellenberg at linbit.com>
> To: users at clusterlabs.org
> Subject: Re: [ClusterLabs] BUG in crmd/ccm membership handling when
>         crmd registers with cluster a bit late
> Message-ID: <20151127115813.GF22693 at soda.linbit>
> Content-Type: text/plain; charset=iso-8859-1
>
> On Fri, Nov 27, 2015 at 10:44:02AM +0530, Shyam wrote:
>> Hi Folks,
>>
>> I am using pacemaker-1.1.10 with heartbeat-3.0.5 on Ubuntu Trusty & hit
>> upon a bug in CRMD/CCM. This bug exists in mainline pacemaker as well.
>
> You know that there is heartbeat 3.0.6 and Pacemaker 1.1.13 ?
>
> heartbeat older than 3.0.6 is broken in a number of ways.
>
> Also, if you want to do Pacemaker with "older" heartbeat,
> you have to go Pacemaker 1.1.6 or older,
> or what you observe here is just one (small) issue.
> 1.1.7 up until 1.1.13
> (or, 1.1.12 + linbit patches, which have then been merged into 1.1.13)
> do not work properly with heartbeat, due to changes in pacemaker
> upstream which have been incompatible with the heartbeat messaging and
> membership layer (especially when it comes to fencing/stonith).>
>
> --
> : Lars Ellenberg
> : http://www.LINBIT.com <http://www.linbit.com/> | Your Way to High
Availability
> : DRBD, Linux-HA  and  Pacemaker support and consulting
>
> DRBD? and LINBIT? are registered trademarks of LINBIT, Austria.

On Fri, Nov 27, 2015 at 10:44 AM, Shyam <shyam.kaushik at gmail.com> wrote:

> Hi Folks,
>
> I am using pacemaker-1.1.10 with heartbeat-3.0.5 on Ubuntu Trusty & hit
> upon a bug in CRMD/CCM. This bug exists in mainline pacemaker as well.
>
> The bug can be easily exposed in a 2-node cluster by adding a sleep(5)
> within do_ha_control() just before it does crm_cluster_connect(cluster) &
> reboot one of the node. This node will remain in pending state & never join
> the cluster.
>
> The bug is this
> # cib/ccm connects into heartbeat cluster earlier & also crmd connects
> into cib/ccm on this node. Because of this peer gets membership events &
> considers crmd on the incoming node as online, though crmd on it is not
> connected to heartbeat cluster. So the node remaining as DC when it sends
> join-offers they are lost & election doesnt progress. Later when the
> incoming node actually connects to heartbeat cluster it generates node
> events that are wrongly considered at CRMD as offline events & we remove
> the membership about the incoming node. After this point all messages from
> peer are completely ignored.
>
> Can you pls suggest what is the best way to fix this problem? Would making
> crmd connecting to heartbeat first & then to rest of cib/ccm help?
>
>
> Here are detailed logs on the problem
>
> node1 is coming up & starts heartbeat
> Nov 26 19:07:08 node1 heartbeat: [2937]: info: Configuration validated.
> Starting heartbeat 3.0.5
>
> as part of this the running node2/DC gets link up event & also gets crm
> event that peer heartbeat is online
>
> Nov 26 19:07:08 node2 heartbeat: [3393]: info: Link node1:eth0 up.
> Nov 26 19:07:08 node2 heartbeat: [3393]: info: Status update for node
> node1: status init
>
> Nov 26 19:07:08 node2 crmd[3523]:   notice: crmd_ha_status_callback:
> Status update: Node node1 now has status [init]
> Nov 26 19:07:08 node2 crmd[3523]:     info: crm_update_peer_proc:
> crmd_ha_status_callback: Node node1[0] - heartbeat is now online
> Nov 26 19:07:08 node2 crmd[3523]:     crit: peer_update_callback: Client
> node1/peer now has status [offline] (DC=true)
> Nov 26 19:07:08 node2 crmd[3523]:     crit: peer_update_callback: No
> change      1 1000001 4000200
> Nov 26 19:07:08 node2 crmd[3523]:   notice: crmd_ha_status_callback:
> Status update: Node node1 now has status [up]
>
>
>
> while crmd/cib on node1 starts up + crmd connects into cib
> Nov 26 19:07:09 node1 crmd[2969]:   notice: main: CRM Git Version: 42f2063
> Nov 26 19:07:09 node1 crmd[2969]:    debug: crmd_init: Starting crmd
>
> Nov 26 19:07:09 node1 cib[2965]:     info: crm_get_peer: Node 0 has uuid
> 00000432-0432-0000-2b91-000000000000
> Nov 26 19:07:09 node1 cib[2965]:     info: register_heartbeat_conn:
> Hostname: node1
> Nov 26 19:07:09 node1 cib[2965]:     info: register_heartbeat_conn: UUID:
> 00000432-0432-0000-2b91-000000000000
> Nov 26 19:07:09 node1 cib[2965]:     info: ccm_connect: Registering with
> CCM...
> Nov 26 19:07:09 node1 cib[2965]:    debug: ccm_connect: CCM Activation
> passed... all set to go!
> Nov 26 19:07:09 node1 cib[2965]:     info: cib_init: Requesting the list
> of configured nodes
>
> Nov 26 19:07:10 node1 crmd[2969]:    debug: cib_native_signon_raw:
> Connection to CIB successful
>
> Nov 26 19:07:10 node1 cib[2965]:     info: cib_process_request: Completed
> cib_query operation for section 'all': OK (rc=0, origin=local/crmd/2,
> version=1.3.0)
> Nov 26 19:07:10 node1 cib[2965]:     info: crm_client_new: Connecting
> 0x7fcbfab10480 for uid=0 gid=0 pid=2967
> id=276f6f8e-14cc-426e-bf63-b79674f1bfaa
> Nov 26 19:07:10 node1 cib[2965]:    debug: handle_new_connection: IPC
> credentials authenticated (2965-2967-12)
> Nov 26 19:07:10 node1 cib[2965]:    debug: qb_ipcs_shm_connect: connecting
> to client [2967]
> Nov 26 19:07:10 node1 cib[2965]:    debug: qb_rb_open_2: shm size:524301;
> real_size:528384; rb->word_size:132096
>
>
> cib on DC gets notified/marked online
> Nov 26 19:07:09 node2 cib[3519]:     info: cib_client_status_callback:
> Status update: Client node1/cib now has status [join]
> Nov 26 19:07:09 node2 crmd[3523]:   notice: crmd_ha_status_callback:
> Status update: Node node1 now has status [active]
> Nov 26 19:07:09 node2 crmd[3523]:   notice: crmd_client_status_callback:
> Status update: Client node1/crmd now has status [join] (DC=true)
> Nov 26 19:07:09 node2 cib[3519]:     info: crm_get_peer: Created entry
> 5886699f-30f8-4b7d-9d38-4e661d99b2d9/0x7fa985e33b70 for node node1/0 (2
> total)
> Nov 26 19:07:09 node2 crmd[3523]:   notice: crmd_client_status_callback:
> Status update: Client node1/crmd now has status [leave] (DC=true)
> Nov 26 19:07:10 node2 cib[3519]:     info: crm_get_peer: Node 0 has uuid
> 00000432-0432-0000-2b91-000000000000
> Nov 26 19:07:10 node2 cib[3519]:     info: crm_update_peer_proc:
> cib_client_status_callback: Node node1[0] - cib is now online
> Nov 26 19:07:10 node2 cib[3519]:     info: cib_client_status_callback:
> Status update: Client node1/cib now has status [leave]
> Nov 26 19:07:10 node2 cib[3519]:     info: crm_update_peer_proc:
> cib_client_status_callback: Node node1[0] - cib is now offline
> Nov 26 19:07:10 node2 cib[3519]:     info: cib_client_status_callback:
> Status update: Client node1/cib now has status [join]
> Nov 26 19:07:10 node2 cib[3519]:     info: crm_update_peer_proc:
> cib_client_status_callback: Node node1[0] - cib is now online
> Nov 26 19:07:10 node2 cib[3519]:     info: cib_process_request: Completed
> cib_modify operation for section status: OK (rc=0, origin=local/crmd/58,
> version=1.3.31)
> Nov 26 19:07:10 node2 cib[3519]:     info: cib_process_request: Completed
> cib_modify operation for section status: OK (rc=0, origin=local/crmd/59,
> version=1.3.31)
> Nov 26 19:07:10 node2 cib[3519]:     info: cib_process_request: Completed
> cib_modify operation for section status: OK (rc=0, origin=local/crmd/60,
> version=1.3.31)
>
>
> as part of CCM processing, the DC considers the peer node for membership &
> marks its crmd online
> Nov 26 19:07:12 node2 ccm: [3518]: debug: quorum plugin: majority
> Nov 26 19:07:12 node2 ccm: [3518]: debug: cluster:linux-ha,
> member_count=2, member_quorum_votes=200
> Nov 26 19:07:12 node2 ccm: [3518]: debug: total_node_count=2,
> total_quorum_votes=200
> Nov 26 19:07:12 node2 crmd[3523]:     info: crmd_ccm_msg_callback: Quorum
> (re)attained after event=NEW MEMBERSHIP (id=2)
> Nov 26 19:07:12 node2 cib[3519]:     info: cib_ccm_msg_callback:
> Processing CCM event=NEW MEMBERSHIP (id=2)
> Nov 26 19:07:12 node2 crmd[3523]:     info: ccm_event_detail: NEW
> MEMBERSHIP: trans=2, nodes=2, new=1, lost=0 n_idx=0, new_idx=2, old_idx=4
> Nov 26 19:07:12 node2 crmd[3523]:     info: ccm_event_detail: #011CURRENT:
> node2 [nodeid=1, born=1]
> Nov 26 19:07:12 node2 cib[3519]:   notice: crm_update_peer_state:
> crm_update_ccm_node: Node node1[0] - state is now member (was (null))
> Nov 26 19:07:12 node2 crmd[3523]:     info: ccm_event_detail: #011CURRENT:
> node1 [nodeid=0, born=2]
> Nov 26 19:07:12 node2 crmd[3523]:     info: ccm_event_detail: #011NEW:
> node1 [nodeid=0, born=2]
> Nov 26 19:07:12 node2 crmd[3523]:   notice: crm_update_peer_state:
> crm_update_ccm_node: Node node1[0] - state is now member (was lost)
> Nov 26 19:07:12 node2 cib[3519]:     info: crm_update_peer_proc:
> crm_update_ccm_node: Node node1[0] - heartbeat is now online
> Nov 26 19:07:12 node2 cib[3519]:     info: crm_update_peer_proc:
> crm_update_ccm_node: Node node1[0] - crmd is now online
> Nov 26 19:07:12 node2 crmd[3523]:     crit: peer_update_callback: node1 is
> now member (was lost)
> Nov 26 19:07:12 node2 crmd[3523]:     crit: peer_update_callback: Alive=0,
> appear=1, down=(nil)
> Nov 26 19:07:12 node2 crmd[3523]:     crit: peer_update_callback: Other
> (nil)
> Nov 26 19:07:12 node2 crmd[3523]:     info: crm_update_peer_proc:
> crm_update_ccm_node: Node node1[0] - crmd is now online
> Nov 26 19:07:12 node2 cib[3519]:     info: cib_process_request: Completed
> cib_modify operation for section status: OK (rc=0, origin=local/crmd/61,
> version=1.3.31)
> Nov 26 19:07:12 node2 crmd[3523]:     crit: peer_update_callback: Client
> node1/peer now has status [online] (DC=true)
> Nov 26 19:07:12 node2 crmd[3523]:     crit: peer_update_callback: Alive=1,
> appear=1, down=(nil)
> Nov 26 19:07:12 node2 crmd[3523]:     crit: peer_update_callback: Other
> (nil)
>
>
> subsequently this triggers the join processing. but since the peer node
> crmd has not yet connected to heartbeat cluster it cannot get the
> join-offers that is being sent by the DC
> Nov 26 19:07:12 node2 crmd[3523]:    debug: post_cache_update: Updated
> cache after membership event 2.
> Nov 26 19:07:12 node2 crmd[3523]:    debug: post_cache_update:
> post_cache_update added action A_ELECTION_CHECK to the FSA
> Nov 26 19:07:12 node2 crmd[3523]:     crit: s_crmd_fsa: Processing
> I_NODE_JOIN: [ state=S_IDLE cause=C_FSA_INTERNAL
> origin=peer_update_callback ]
> Nov 26 19:07:12 node2 crmd[3523]:   notice: do_state_transition: State
> transition S_IDLE -> S_INTEGRATION [ input=I_NODE_JOIN cause=C_FSA_INTERNAL
> origin=peer_update_callback ]
> Nov 26 19:07:12 node2 crmd[3523]:    debug: crm_timer_start: Started
> Integration Timer (I_INTEGRATED:180000ms), src=95
> Nov 26 19:07:12 node2 crmd[3523]:    debug: do_election_check: Ignore
> election check: we not in an election
> Nov 26 19:07:12 node2 crmd[3523]:     crit: do_dc_join_offer_one: An
> unknown node joined - (re-)offer to any unconfirmed nodes
> Nov 26 19:07:12 node2 crmd[3523]:     crit: join_make_offer: about to make
> an offer to node2
> Nov 26 19:07:12 node2 crmd[3523]:     crit: join_make_offer: Making join
> offers based on membership 2
> Nov 26 19:07:12 node2 crmd[3523]:     crit: join_make_offer: Skipping
> node2: already known 4
> Nov 26 19:07:12 node2 crmd[3523]:     crit: join_make_offer: about to make
> an offer to node1
> Nov 26 19:07:12 node2 crmd[3523]:     crit: join_make_offer: join-1:
> Sending offer to node1
> Nov 26 19:07:12 node2 crmd[3523]:     info: crm_update_peer_join:
> join_make_offer: Node node1[0] - join-1 phase 0 -> 1
> Nov 26 19:07:12 node2 crmd[3523]:    debug: check_join_state: Invoked by
> do_dc_join_offer_one in state: S_INTEGRATION
> Nov 26 19:07:12 node2 crmd[3523]:    debug: do_te_invoke: Halting the
> transition: inactive
> Nov 26 19:07:12 node2 crmd[3523]:     info: abort_transition_graph:
> do_te_invoke:158 - Triggered transition abort (complete=1) : Peer Halt
> Nov 26 19:07:12 node2 crmd[3523]:     crit: s_crmd_fsa: Processing
> I_PE_CALC: [ state=S_INTEGRATION cause=C_FSA_INTERNAL
> origin=abort_transition_graph ]
>
>
> later crmd on the node that is coming up connects to heartbeat
> Nov 26 19:07:15 node1 crmd[2969]:   notice: crm_cluster_connect:
> Connecting to cluster infrastructure: heartbeat
> Nov 26 19:07:15 node1 crmd[2969]:    debug: register_heartbeat_conn:
> Signing in with Heartbeat
> Nov 26 19:07:15 node1 crmd[2969]:     info: crm_get_peer: Created entry
> 75b26770-ac15-461c-889d-190e6c8139ac/0x7f2fd194f870 for node node1/0 (1
> total)
> Nov 26 19:07:15 node1 crmd[2969]:     crit: peer_update_callback: node1 is
> now (null)
> Nov 26 19:07:15 node1 crmd[2969]:     info: crm_get_peer: Node 0 has uuid
> 00000432-0432-0000-2b91-000000000000
> Nov 26 19:07:15 node1 crmd[2969]:     info: register_heartbeat_conn:
> Hostname: node1
> Nov 26 19:07:15 node1 crmd[2969]:     info: register_heartbeat_conn: UUID:
> 00000432-0432-0000-2b91-000000000000
>
>
> & at this point heartbeat delivers a notification into crmd as part of
> which crmd wrongly moves the status of the incoming node as
> offline/removing out of membership
> Nov 26 19:07:15 node2 crmd[3523]:   notice: crmd_client_status_callback:
> Status update: Client node1/crmd now has status [join] (DC=true)
> Nov 26 19:07:15 node2 crmd[3523]:     info: crm_update_peer_proc:
> crmd_client_status_callback: Node node1[0] - crmd is now join
> Nov 26 19:07:15 node2 crmd[3523]:     crit: peer_update_callback: Client
> node1/peer now has status [offline] (DC=true)
>
> Nov 26 19:07:15 node2 crmd[3523]:  warning: match_down_event: No match for
> shutdown action on 00000432-0432-0000-2b91-000000000000
> Nov 26 19:07:15 node2 crmd[3523]:     crit: peer_update_callback: Alive=0,
> appear=0, down=(nil)
> Nov 26 19:07:15 node2 crmd[3523]:   notice: peer_update_callback:
> Stonith/shutdown of node1 not matched
> Nov 26 19:07:15 node2 crmd[3523]:     info: crm_update_peer_join:
> peer_update_callback: Node node1[0] - join-1 phase 1 -> 0
> Nov 26 19:07:15 node2 crmd[3523]:    debug: check_join_state: Invoked by
> peer_update_callback in state: S_INTEGRATION
> Nov 26 19:07:15 node2 crmd[3523]:    debug: check_join_state: join-1:
> Integration of 0 peers complete: peer_update_callback
> Nov 26 19:07:15 node2 crmd[3523]:     info: abort_transition_graph:
> peer_update_callback:214 - Triggered transition abort (complete=1) : Node
> failure
>
>
> so bottom-line due to a timing issue when crmd connects into hearbeat
> cluster we wrongly marked in the DC that the incoming node is not a member.
> this causes any message from incoming node from this point onwards to be
> ignored like
>
> Nov 26 19:07:18 node2 crmd[3523]:    debug: crmd_ha_msg_callback: Ignoring
> HA message (op=noop) from node1: not in our membership list (size=1)
> Nov 26 19:07:19 node2 crmd[3523]:  warning: crmd_ha_msg_callback: Ignoring
> HA message (op=join_announce) from node1: not in our membership list
> (size=1)
>
> --Shyam
>
>
> On Mon, Nov 23, 2015 at 9:30 PM, Shyam <shyam.kaushik at gmail.com> wrote:
>
>> One note on this.
>>
>> This problem doesnt happen if
>>
>> Nov 19 08:36:30 node1 crmd[3298]:   notice: crmd_client_status_callback:
>> Status update: Client node2/crmd now has status [join] (DC=true)
>> Nov 19 08:36:30 node1 crmd[3298]:   notice: crmd_client_status_callback:
>> Status update: Client node2/crmd now has status [leave] (DC=true)
>> Nov 19 08:36:31 node1 crmd[3298]:   notice: crmd_client_status_callback:
>> Status update: Client node2/crmd now has status [join] (DC=true)
>>
>> join happens before
>> Nov 19 08:36:34 node1 crmd[3298]:   notice: do_state_transition: State
>> transition S_IDLE -> S_INTEGRATION [ input=I_NODE_JOIN cause=C_FSA_INTERNAL
>> origin=peer_update_callback ]
>>
>> I am not sure why crmd join/leave & then join happens. With previous
>> version of heartbeat/pacemaker this doesnt happen.
>>
>> here are more logs when the problem doesnt happen
>>
>> Nov 19 08:36:29 node1 heartbeat: [3143]: info: Link node2:eth0 up.
>> Nov 19 08:36:29 node1 heartbeat: [3143]: info: Status update for node
>> node2: status init
>> Nov 19 08:36:29 node1 crmd[3298]:   notice: crmd_ha_status_callback:
>> Status update: Node node2 now has status [init]
>> Nov 19 08:36:29 node1 crmd[3298]:   notice: crmd_ha_status_callback:
>> Status update: Node node2 now has status [up]
>> Nov 19 08:36:30 node1 heartbeat: [3143]: debug: get_delnodelist:
>> delnodelist=
>> Nov 19 08:36:30 node1 heartbeat: [3143]: info: Status update for node
>> node2: status active
>> Nov 19 08:36:30 node1 crmd[3298]:   notice: crmd_ha_status_callback:
>> Status update: Node node2 now has status [active]
>> Nov 19 08:36:30 node1 crmd[3298]:   notice: crmd_client_status_callback:
>> Status update: Client node2/crmd now has status [join] (DC=true)
>> Nov 19 08:36:30 node1 crmd[3298]:   notice: crmd_client_status_callback:
>> Status update: Client node2/crmd now has status [leave] (DC=true)
>> Nov 19 08:36:31 node1 crmd[3298]:   notice: crmd_client_status_callback:
>> Status update: Client node2/crmd now has status [join] (DC=true)
>> Nov 19 08:36:34 node1 ccm: [3293]: debug: quorum plugin: majority
>> Nov 19 08:36:34 node1 ccm: [3293]: debug: cluster:linux-ha,
>> member_count=2, member_quorum_votes=200
>> Nov 19 08:36:34 node1 ccm: [3293]: debug: total_node_count=2,
>> total_quorum_votes=200
>> Nov 19 08:36:34 node1 cib[3294]:   notice: crm_update_peer_state:
>> crm_update_ccm_node: Node node2[1] - state is now member (was (null))
>> Nov 19 08:36:34 node1 crmd[3298]:   notice: crm_update_peer_state:
>> crm_update_ccm_node: Node node2[1] - state is now member (was lost)
>> Nov 19 08:36:34 node1 crmd[3298]:   notice: do_state_transition: State
>> transition S_IDLE -> S_INTEGRATION [ input=I_NODE_JOIN cause=C_FSA_INTERNAL
>> origin=peer_update_callback ]
>>
>> after the above, resources are started alright
>> Nov 19 08:36:40 node1 attrd[3297]:   notice: attrd_local_callback:
>> Sending full refresh (origin=crmd)
>> Nov 19 08:36:40 node1 attrd[3297]:   notice: attrd_trigger_update:
>> Sending flush op to all hosts for: probe_complete (true)
>> Nov 19 08:36:40 node1 attrd[3297]:   notice: attrd_trigger_update:
>> Sending flush op to all hosts for: master-MYSQL:0 (100)
>> Nov 19 08:36:41 node1 pengine[3299]:   notice: unpack_config: On loss of
>> CCM Quorum: Ignore
>> Nov 19 08:36:41 node1 pengine[3299]:   notice: LogActions: Start
>> IPADDR:1#011(node2)
>>
>> Any help/pointers greatly apprecited.
>>
>> Thanks.
>>
>> --Shyam
>>
>> On Mon, Nov 23, 2015 at 12:14 PM, Shyam <shyam.kaushik at gmail.com> wrote:
>>
>>> Hi All,
>>>
>>> Need a help w.r.t. a timing issue that I hit when using pacemaker-1.1.10
>>> with heartbeat-3.0.5 on Ubuntu Trusty.
>>>
>>> As seen in logs below, issue appears to be w.r.t. membership that CRMD
>>> looks at that it wrongly moves to INTEGRATION phase when the peer CRMD
>>> hasnt joined the cluster fully yet.
>>>
>>> We have a 2 node cluster & occasionally when one of the node comes up
>>> after reboot it keeps remaining in S_PENDING state/until
>>> heartbeat/pacemaker on that node is restarted. This happens periodically &
>>> not always.
>>>
>>> Below logs shows the problem clearly
>>>
>>> ndoe1 has been running & sees node2 coming up
>>>
>>> Nov 20 09:09:10 node1 crmd[3905]:   notice: crmd_ha_status_callback:
>>> Status update: Node node2 now has status [init]
>>> Nov 20 09:09:10 node1 crmd[3905]:   notice: crmd_ha_status_callback:
>>> Status update: Node node2 now has status [up]
>>> Nov 20 09:09:10 node1 heartbeat: [2943]: info: Link node2:eth0 up.
>>> Nov 20 09:09:10 node1 heartbeat: [2943]: info: Status update for node
>>> node2: status init
>>> Nov 20 09:09:10 node1 heartbeat: [2943]: info: Status update for node
>>> node2: status up
>>> Nov 20 09:09:10 node1 heartbeat: [2943]: debug: get_delnodelist:
>>> delnodelist=
>>> Nov 20 09:09:10 node1 heartbeat: [2943]: info: all clients are now paused
>>> Nov 20 09:09:10 node1 crmd[3905]:   notice: crmd_ha_status_callback:
>>> Status update: Node node2 now has status [active]
>>> Nov 20 09:09:10 node1 crmd[3905]:   notice: crmd_client_status_callback:
>>> Status update: Client node2/crmd now has status [join] (DC=true)
>>> Nov 20 09:09:10 node1 heartbeat: [2943]: info: Status update for node
>>> node2: status active
>>> Nov 20 09:09:10 node1 crmd[3905]:   notice: crmd_client_status_callback:
>>> Status update: Client node2/crmd now has status [leave] (DC=true)
>>> Nov 20 09:09:11 node1 heartbeat: [2943]: info: all clients are now
>>> resumed
>>>
>>>
>>> as can be seen above, w.r.t. CRMD on peer, it registered at heartbeat &
>>> de-registered (not sure why)
>>> Nov 20 09:09:10 node1 crmd[3905]:   notice: crmd_client_status_callback:
>>> Status update: Client node2/crmd now has status [join] (DC=true)
>>> Nov 20 09:09:10 node1 crmd[3905]:   notice: crmd_client_status_callback:
>>> Status update: Client node2/crmd now has status [leave] (DC=true)
>>>
>>>
>>> while this is the state CRMD on node1 moves to INTEGRATION phase & there
>>> are join requests flowing through, but presume I cannot be handled by peer
>>>
>>> Nov 20 09:09:13 node1 cib[3901]:   notice: crm_update_peer_state:
>>> crm_update_ccm_node: Node node2[0] - state is now member (was (null))
>>> Nov 20 09:09:13 node1 crmd[3905]:   notice: crm_update_peer_state:
>>> crm_update_ccm_node: Node node2[0] - state is now member (was lost)
>>> Nov 20 09:09:13 node1 crmd[3905]:   notice: do_state_transition: State
>>> transition S_IDLE -> S_INTEGRATION [ input=I_NODE_JOIN cause=C_FSA_INTERNAL
>>> origin=peer_update_callback ]
>>> Nov 20 09:09:13 node1 ccm: [3900]: debug: quorum plugin: majority
>>> Nov 20 09:09:13 node1 ccm: [3900]: debug: cluster:linux-ha,
>>> member_count=2, member_quorum_votes=200
>>> Nov 20 09:09:13 node1 ccm: [3900]: debug: total_node_count=2,
>>> total_quorum_votes=200
>>> Nov 20 09:09:13 node1 crmd[3905]:   notice: crmd_client_status_callback:
>>> Status update: Client node2/crmd now has status [join] (DC=true)
>>> Nov 20 09:09:13 node1 crmd[3905]:  warning: match_down_event: No match
>>> for shutdown action on 00000432-0432-0000-2b91-000000000000
>>> Nov 20 09:09:13 node1 crmd[3905]:   notice: peer_update_callback:
>>> Stonith/shutdown of node2 not matched
>>> Nov 20 09:09:13 node1 attrd[3904]:   notice: attrd_local_callback:
>>> Sending full refresh (origin=crmd)
>>> Nov 20 09:09:13 node1 attrd[3904]:   notice: attrd_trigger_update:
>>> Sending flush op to all hosts for: probe_complete (true)
>>> Nov 20 09:09:13 node1 attrd[3904]:   notice: attrd_trigger_update:
>>> Sending flush op to all hosts for: master-MYSQL:0 (100)
>>> Nov 20 09:09:13 node1 pengine[3906]:   notice: unpack_config: On loss of
>>> CCM Quorum: Ignore
>>> Nov 20 09:09:13 node1 pengine[3906]:  warning: custom_action: Action
>>> IPADDR:0_monitor_0 on node2 is unrunnable (pending)
>>>
>>>
>>> As can be seen we had
>>> Nov 20 09:09:13 node1 crmd[3905]:   notice: do_state_transition: State
>>> transition S_IDLE -> S_INTEGRATION [ input=I_NODE_JOIN cause=C_FSA_INTERNAL
>>>
>>> & only then CRMD on peer seems to be alright
>>> Nov 20 09:09:13 node1 crmd[3905]:   notice: crmd_client_status_callback:
>>> Status update: Client node2/crmd now has status [join] (DC=true)
>>>
>>> so pengine decided not being able to run any resources & we are stuck in
>>> this state. While node2 has this trace
>>> Nov 20 09:09:13 node2 cib[2964]:   notice: cib_server_process_diff: Not
>>> applying diff 1.3.33 -> 1.3.34 (sync in progress)
>>> Nov 20 09:09:13 node2 cib[2964]:   notice: crm_update_peer_state:
>>> crm_update_ccm_node: Node node1[1] - state is now member (was (null))
>>> Nov 20 09:09:13 node2 cib[2964]:   notice: crm_update_peer_state:
>>> crm_update_ccm_node: Node node2[0] - state is now member (was (null))
>>> Nov 20 09:09:13 node2 crmd[2968]:   notice: crm_cluster_connect:
>>> Connecting to cluster infrastructure: heartbeat
>>> Nov 20 09:09:14 node2 heartbeat: [2938]: info: the send queue length
>>> from heartbeat to client crmd is set to 1024
>>> Nov 20 09:09:15 node2 crmd[2968]:   notice: crmd_client_status_callback:
>>> Status update: Client node2/crmd now has status [join] (DC=false)
>>> Nov 20 09:09:15 node2 crmd[2968]:   notice: crmd_client_status_callback:
>>> Status update: Client node2/crmd now has status [online] (DC=false)
>>> Nov 20 09:09:15 node2 crmd[2968]:   notice: crmd_client_status_callback:
>>> Status update: Client node1/crmd now has status [online] (DC=false)
>>> Nov 20 09:09:16 node2 crmd[2968]:   notice: crm_update_peer_state:
>>> crm_update_ccm_node: Node node1[1] - state is now member (was (null))
>>> Nov 20 09:09:16 node2 crmd[2968]:   notice: crm_update_peer_state:
>>> crm_update_ccm_node: Node node2[0] - state is now member (was (null))
>>> Nov 20 09:09:16 node2 crmd[2968]:   notice: do_started: The local CRM is
>>> operational
>>> Nov 20 09:09:16 node2 crmd[2968]:   notice: do_state_transition: State
>>> transition S_STARTING -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL
>>> origin=do_started ]
>>> Nov 20 09:09:27 node2 crmd[2968]:  warning: do_log: FSA: Input
>>> I_DC_TIMEOUT from crm_timer_popped() received in state S_PENDING
>>>
>>>
>>> i.e. it moves to S_PENDING state & keeps receiving I_DC_TIMEOUT
>>>
>>> I tried using latest heartbeat, but that doesnt seem to be the problem.
>>> Can anyone suggest if this issue has already been fixed in latest pacemaker
>>> or any other suggestions how to debug this issue?
>>>
>>> If I enable higher debug level (both in heartbeat/pacemaker), this
>>> problem doesnt show up. Any help/pointers on how to go forward is greatly
>>> appreciated.
>>>
>>> Thanks!
>>>
>>> --Shyam
>>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20151129/34a29fdc/attachment-0002.html>