<div dir="ltr"><div><span style="font-size:12.8px">Hi Lars,</span></div><div><span style="font-size:12.8px"><br></span></div><div><span style="font-size:12.8px">Thanks for your reply. We used pacemaker-1.1.10/heartbeat-3.0.5 as unfortunately thats what Ubuntu Trusty LTS comes up with. In fact Ubuntu xenial still has heartbeat-3.0.5</span></div><div><span style="font-size:12.8px"><br></span></div><div><span style="font-size:12.8px">We will move to heartbeat-3.0.6/pacemaker-1.1.13 like you suggest. One question for you.</span></div><div><span style="font-size:12.8px"><br></span></div><div>> Also, if you want to do Pacemaker with "older" heartbeat,<br>> you have to go Pacemaker 1.1.6 or older,<br>> or what you observe here is just one (small) issue.<br>> 1.1.7 up until 1.1.13<br>> (or, 1.1.12 + linbit patches, which have then been merged into 1.1.13)<br>> do not work properly with heartbeat, due to changes in pacemaker<br>> upstream which have been incompatible with the heartbeat messaging and<br>> membership layer (especially when it comes to fencing/stonith).<span style="font-size:12.8px"><br></span></div><div><br></div><div>Are you saying that pacemaker-1.1.13 will work alright with heartbeat-3.0.6 or are there known issues that we must go back to pacemaker-1.1.6?</div><div><br></div><div>At this point it will be hard for us to shift to corosync & will be better to continue with heartbeat. So your inputs are greatly appreciated. Thank you!</div><div><br></div><div>--Shyam</div><span style="font-size:12.8px"><div><span style="font-size:12.8px"><br></span></div></span><span style="font-size:12.8px"><div><span style="font-size:12.8px"><br></span></div>> Date: Fri, 27 Nov 2015 12:58:15 +0100</span><br style="font-size:12.8px"><span style="font-size:12.8px">> From: Lars Ellenberg <</span><a href="mailto:lars.ellenberg@linbit.com" style="font-size:12.8px">lars.ellenberg@linbit.com</a><span style="font-size:12.8px">></span><br style="font-size:12.8px"><span style="font-size:12.8px">> To: </span><a href="mailto:users@clusterlabs.org" style="font-size:12.8px">users@clusterlabs.org</a><br style="font-size:12.8px"><span style="font-size:12.8px">> Subject: Re: [ClusterLabs] BUG in crmd/ccm membership handling when</span><br style="font-size:12.8px"><span style="font-size:12.8px">> crmd registers with cluster a bit late</span><br style="font-size:12.8px"><span style="font-size:12.8px">> Message-ID: <20151127115813.GF22693@soda.</span><span style="font-size:12.8px">linbit></span><br style="font-size:12.8px"><span style="font-size:12.8px">> Content-Type: text/plain; charset=iso-8859-1</span><br style="font-size:12.8px">> <br style="font-size:12.8px"><span style="font-size:12.8px">> On Fri, Nov 27, 2015 at 10:44:02AM +0530, Shyam wrote:</span><br style="font-size:12.8px"><span style="font-size:12.8px">>> Hi Folks,</span><br style="font-size:12.8px"><span style="font-size:12.8px">>></span><br style="font-size:12.8px"><span style="font-size:12.8px">>> I am using pacemaker-1.1.10 with heartbeat-3.0.5 on Ubuntu Trusty & hit</span><br style="font-size:12.8px"><span style="font-size:12.8px">>> upon a bug in CRMD/CCM. This bug exists in mainline pacemaker as well.</span><br style="font-size:12.8px">><br style="font-size:12.8px"><span style="font-size:12.8px">> You know that there is heartbeat 3.0.6 and Pacemaker 1.1.13 ?</span><br style="font-size:12.8px">><br style="font-size:12.8px"><span style="font-size:12.8px">> heartbeat older than 3.0.6 is broken in a number of ways.</span><br style="font-size:12.8px">> <br style="font-size:12.8px">> Also, if you want to do Pacemaker with "older" heartbeat,<br>> you have to go Pacemaker 1.1.6 or older,<br>> or what you observe here is just one (small) issue.<br>> 1.1.7 up until 1.1.13<br>> (or, 1.1.12 + linbit patches, which have then been merged into 1.1.13)<br>> do not work properly with heartbeat, due to changes in pacemaker<br>> upstream which have been incompatible with the heartbeat messaging and<br>> membership layer (especially when it comes to fencing/stonith).><br style="font-size:12.8px">><br style="font-size:12.8px"><span style="font-size:12.8px">> --</span><br style="font-size:12.8px"><span style="font-size:12.8px">> : Lars Ellenberg</span><br style="font-size:12.8px"><span style="font-size:12.8px">> : </span><a href="http://www.linbit.com/" rel="noreferrer" target="_blank" style="font-size:12.8px">http://www.LINBIT.com</a><span style="font-size:12.8px"> | Your Way to High Availability</span><br style="font-size:12.8px"><span style="font-size:12.8px">> : DRBD, Linux-HA and Pacemaker support and consulting</span><br style="font-size:12.8px">> <br style="font-size:12.8px"><span style="font-size:12.8px">> DRBD? and LINBIT? are registered trademarks of LINBIT, Austria.</span><br style="font-size:12.8px"></div><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Nov 27, 2015 at 10:44 AM, Shyam <span dir="ltr"><<a href="mailto:shyam.kaushik@gmail.com" target="_blank">shyam.kaushik@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Hi Folks,<div><br></div><div><div style="font-size:12.8px">I am using pacemaker-1.1.10 with heartbeat-3.0.5 on Ubuntu Trusty & hit upon a bug in CRMD/CCM. This bug exists in mainline pacemaker as well.</div></div><div style="font-size:12.8px"><br></div><div><span style="font-size:12.8px">The bug can be easily exposed in a 2-node cluster by adding a sleep(5) within do_ha_control() just before it does crm_cluster_connect(cluster) & reboot one of the node. This node will remain in pending state & never join the cluster.</span></div><div><span style="font-size:12.8px"><br></span></div><div><span style="font-size:12.8px">The bug is this</span></div><div><span style="font-size:12.8px"># cib/ccm connects into heartbeat cluster earlier & also crmd connects into cib/ccm on this node. Because of this peer gets membership events & considers crmd on the incoming node as online, though crmd on it is not connected to heartbeat cluster. So the node remaining as DC when it sends join-offers they are lost & election doesnt progress. Later when the incoming node actually connects to heartbeat cluster it generates node events that are wrongly considered at CRMD as offline events & we remove the membership about the incoming node. After this point all messages from peer are completely ignored. </span></div><div><span style="font-size:12.8px"><br></span></div><div><span style="font-size:12.8px">Can you pls suggest what is the best way to fix this problem? Would making crmd connecting to heartbeat first & then to rest of cib/ccm help?</span></div><div><span style="font-size:12.8px"><br></span></div><div><span style="font-size:12.8px"><br></span></div><div><span style="font-size:12.8px">Here are detailed logs on the problem</span></div><div><span style="font-size:12.8px"><br></span></div><div><span style="font-size:12.8px">node1 is coming up & starts heartbeat</span></div><div><span style="font-size:12.8px">Nov 26 19:07:08 node1 heartbeat: [2937]: info: Configuration validated. Starting heartbeat 3.0.5</span><br></div><div><span style="font-size:12.8px"><br></span></div><div><span style="font-size:12.8px">as part of this the running node2/DC gets link up event & also gets crm event that peer heartbeat is online</span></div><div><div><span style="font-size:12.8px"><br></span></div><div><div>Nov 26 19:07:08 node2 heartbeat: [3393]: info: Link node1:eth0 up.</div><div>Nov 26 19:07:08 node2 heartbeat: [3393]: info: Status update for node node1: status init</div><div><br></div><div><div>Nov 26 19:07:08 node2 crmd[3523]: notice: crmd_ha_status_callback: Status update: Node node1 now has status [init]</div><div>Nov 26 19:07:08 node2 crmd[3523]: info: crm_update_peer_proc: crmd_ha_status_callback: Node node1[0] - heartbeat is now online</div><div>Nov 26 19:07:08 node2 crmd[3523]: crit: peer_update_callback: Client node1/peer now has status [offline] (DC=true)</div><div>Nov 26 19:07:08 node2 crmd[3523]: crit: peer_update_callback: No change 1 1000001 4000200</div><div>Nov 26 19:07:08 node2 crmd[3523]: notice: crmd_ha_status_callback: Status update: Node node1 now has status [up]</div></div><div><br></div><div><br></div><div><br></div><div>while crmd/cib on node1 starts up + crmd connects into cib</div><div><div>Nov 26 19:07:09 node1 crmd[2969]: notice: main: CRM Git Version: 42f2063</div><div>Nov 26 19:07:09 node1 crmd[2969]: debug: crmd_init: Starting crmd</div><div><br></div><div>Nov 26 19:07:09 node1 cib[2965]: info: crm_get_peer: Node 0 has uuid 00000432-0432-0000-2b91-000000000000</div><div>Nov 26 19:07:09 node1 cib[2965]: info: register_heartbeat_conn: Hostname: node1</div><div>Nov 26 19:07:09 node1 cib[2965]: info: register_heartbeat_conn: UUID: 00000432-0432-0000-2b91-000000000000</div><div>Nov 26 19:07:09 node1 cib[2965]: info: ccm_connect: Registering with CCM...</div><div>Nov 26 19:07:09 node1 cib[2965]: debug: ccm_connect: CCM Activation passed... all set to go!</div><div>Nov 26 19:07:09 node1 cib[2965]: info: cib_init: Requesting the list of configured nodes</div><div><br></div><div>Nov 26 19:07:10 node1 crmd[2969]: debug: cib_native_signon_raw: Connection to CIB successful</div><div><br></div><div>Nov 26 19:07:10 node1 cib[2965]: info: cib_process_request: Completed cib_query operation for section 'all': OK (rc=0, origin=local/crmd/2, version=1.3.0)</div><div>Nov 26 19:07:10 node1 cib[2965]: info: crm_client_new: Connecting 0x7fcbfab10480 for uid=0 gid=0 pid=2967 id=276f6f8e-14cc-426e-bf63-b79674f1bfaa</div><div>Nov 26 19:07:10 node1 cib[2965]: debug: handle_new_connection: IPC credentials authenticated (2965-2967-12)</div><div>Nov 26 19:07:10 node1 cib[2965]: debug: qb_ipcs_shm_connect: connecting to client [2967]</div><div>Nov 26 19:07:10 node1 cib[2965]: debug: qb_rb_open_2: shm size:524301; real_size:528384; rb->word_size:132096</div></div></div></div><div><span style="font-size:12.8px"><br></span></div><div><span style="font-size:12.8px"><br></span></div><div><span style="font-size:12.8px">cib on DC gets notified/marked online</span></div><div><div style="font-size:12.8px">Nov 26 19:07:09 node2 cib[3519]: info: cib_client_status_callback: Status update: Client node1/cib now has status [join]</div><div style="font-size:12.8px">Nov 26 19:07:09 node2 crmd[3523]: notice: crmd_ha_status_callback: Status update: Node node1 now has status [active]</div><div style="font-size:12.8px">Nov 26 19:07:09 node2 crmd[3523]: notice: crmd_client_status_callback: Status update: Client node1/crmd now has status [join] (DC=true)</div><div style="font-size:12.8px">Nov 26 19:07:09 node2 cib[3519]: info: crm_get_peer: Created entry 5886699f-30f8-4b7d-9d38-4e661d99b2d9/0x7fa985e33b70 for node node1/0 (2 total)</div><div style="font-size:12.8px">Nov 26 19:07:09 node2 crmd[3523]: notice: crmd_client_status_callback: Status update: Client node1/crmd now has status [leave] (DC=true)</div><div style="font-size:12.8px">Nov 26 19:07:10 node2 cib[3519]: info: crm_get_peer: Node 0 has uuid 00000432-0432-0000-2b91-000000000000</div><div style="font-size:12.8px">Nov 26 19:07:10 node2 cib[3519]: info: crm_update_peer_proc: cib_client_status_callback: Node node1[0] - cib is now online</div><div style="font-size:12.8px">Nov 26 19:07:10 node2 cib[3519]: info: cib_client_status_callback: Status update: Client node1/cib now has status [leave]</div><div style="font-size:12.8px">Nov 26 19:07:10 node2 cib[3519]: info: crm_update_peer_proc: cib_client_status_callback: Node node1[0] - cib is now offline</div><div style="font-size:12.8px">Nov 26 19:07:10 node2 cib[3519]: info: cib_client_status_callback: Status update: Client node1/cib now has status [join]</div><div style="font-size:12.8px">Nov 26 19:07:10 node2 cib[3519]: info: crm_update_peer_proc: cib_client_status_callback: Node node1[0] - cib is now online</div><div style="font-size:12.8px">Nov 26 19:07:10 node2 cib[3519]: info: cib_process_request: Completed cib_modify operation for section status: OK (rc=0, origin=local/crmd/58, version=1.3.31)</div><div style="font-size:12.8px">Nov 26 19:07:10 node2 cib[3519]: info: cib_process_request: Completed cib_modify operation for section status: OK (rc=0, origin=local/crmd/59, version=1.3.31)</div><div style="font-size:12.8px">Nov 26 19:07:10 node2 cib[3519]: info: cib_process_request: Completed cib_modify operation for section status: OK (rc=0, origin=local/crmd/60, version=1.3.31)</div><div style="font-size:12.8px"><br></div><div style="font-size:12.8px"><br></div><div style="font-size:12.8px">as part of CCM processing, the DC considers the peer node for membership & marks its crmd online</div><div><div><span style="font-size:12.8px">Nov 26 19:07:12 node2 ccm: [3518]: debug: quorum plugin: majority</span></div><div><span style="font-size:12.8px">Nov 26 19:07:12 node2 ccm: [3518]: debug: cluster:linux-ha, member_count=2, member_quorum_votes=200</span></div><div><span style="font-size:12.8px">Nov 26 19:07:12 node2 ccm: [3518]: debug: total_node_count=2, total_quorum_votes=200</span></div><div><span style="font-size:12.8px">Nov 26 19:07:12 node2 crmd[3523]: info: crmd_ccm_msg_callback: Quorum (re)attained after event=NEW MEMBERSHIP (id=2)</span></div><div><span style="font-size:12.8px">Nov 26 19:07:12 node2 cib[3519]: info: cib_ccm_msg_callback: Processing CCM event=NEW MEMBERSHIP (id=2)</span></div><div><span style="font-size:12.8px">Nov 26 19:07:12 node2 crmd[3523]: info: ccm_event_detail: NEW MEMBERSHIP: trans=2, nodes=2, new=1, lost=0 n_idx=0, new_idx=2, old_idx=4</span></div><div><span style="font-size:12.8px">Nov 26 19:07:12 node2 crmd[3523]: info: ccm_event_detail: #011CURRENT: node2 [nodeid=1, born=1]</span></div><div><span style="font-size:12.8px">Nov 26 19:07:12 node2 cib[3519]: notice: crm_update_peer_state: crm_update_ccm_node: Node node1[0] - state is now member (was (null))</span></div><div><span style="font-size:12.8px">Nov 26 19:07:12 node2 crmd[3523]: info: ccm_event_detail: #011CURRENT: node1 [nodeid=0, born=2]</span></div><div><span style="font-size:12.8px">Nov 26 19:07:12 node2 crmd[3523]: info: ccm_event_detail: #011NEW: node1 [nodeid=0, born=2]</span></div><div><span style="font-size:12.8px">Nov 26 19:07:12 node2 crmd[3523]: notice: crm_update_peer_state: crm_update_ccm_node: Node node1[0] - state is now member (was lost)</span></div><div><span style="font-size:12.8px">Nov 26 19:07:12 node2 cib[3519]: info: crm_update_peer_proc: crm_update_ccm_node: Node node1[0] - heartbeat is now online</span></div><div><span style="font-size:12.8px">Nov 26 19:07:12 node2 cib[3519]: info: crm_update_peer_proc: crm_update_ccm_node: Node node1[0] - crmd is now online</span></div><div><span style="font-size:12.8px">Nov 26 19:07:12 node2 crmd[3523]: crit: peer_update_callback: node1 is now member (was lost)</span></div><div><span style="font-size:12.8px">Nov 26 19:07:12 node2 crmd[3523]: crit: peer_update_callback: Alive=0, appear=1, down=(nil)</span></div><div><span style="font-size:12.8px">Nov 26 19:07:12 node2 crmd[3523]: crit: peer_update_callback: Other (nil)</span></div><div><span style="font-size:12.8px">Nov 26 19:07:12 node2 crmd[3523]: info: crm_update_peer_proc: crm_update_ccm_node: Node node1[0] - crmd is now online</span></div><div><span style="font-size:12.8px">Nov 26 19:07:12 node2 cib[3519]: info: cib_process_request: Completed cib_modify operation for section status: OK (rc=0, origin=local/crmd/61, version=1.3.31)</span></div><div><span style="font-size:12.8px">Nov 26 19:07:12 node2 crmd[3523]: crit: peer_update_callback: Client node1/peer now has status [online] (DC=true)</span></div><div><span style="font-size:12.8px">Nov 26 19:07:12 node2 crmd[3523]: crit: peer_update_callback: Alive=1, appear=1, down=(nil)</span></div><div><span style="font-size:12.8px">Nov 26 19:07:12 node2 crmd[3523]: crit: peer_update_callback: Other (nil)</span></div><div><span style="font-size:12.8px"><br></span></div><div><span style="font-size:12.8px"><br></span></div><div><span style="font-size:12.8px">subsequently this triggers the join processing. but since the peer node crmd has not yet connected to heartbeat cluster it cannot get the join-offers that is being sent by the DC</span></div><div><div style="font-size:12.8px">Nov 26 19:07:12 node2 crmd[3523]: debug: post_cache_update: Updated cache after membership event 2.</div><div style="font-size:12.8px">Nov 26 19:07:12 node2 crmd[3523]: debug: post_cache_update: post_cache_update added action A_ELECTION_CHECK to the FSA</div><div style="font-size:12.8px">Nov 26 19:07:12 node2 crmd[3523]: crit: s_crmd_fsa: Processing I_NODE_JOIN: [ state=S_IDLE cause=C_FSA_INTERNAL origin=peer_update_callback ]</div><div style="font-size:12.8px">Nov 26 19:07:12 node2 crmd[3523]: notice: do_state_transition: State transition S_IDLE -> S_INTEGRATION [ input=I_NODE_JOIN cause=C_FSA_INTERNAL origin=peer_update_callback ]</div><div style="font-size:12.8px">Nov 26 19:07:12 node2 crmd[3523]: debug: crm_timer_start: Started Integration Timer (I_INTEGRATED:180000ms), src=95</div><div style="font-size:12.8px">Nov 26 19:07:12 node2 crmd[3523]: debug: do_election_check: Ignore election check: we not in an election</div><div style="font-size:12.8px">Nov 26 19:07:12 node2 crmd[3523]: crit: do_dc_join_offer_one: An unknown node joined - (re-)offer to any unconfirmed nodes</div><div style="font-size:12.8px">Nov 26 19:07:12 node2 crmd[3523]: crit: join_make_offer: about to make an offer to node2</div><div style="font-size:12.8px">Nov 26 19:07:12 node2 crmd[3523]: crit: join_make_offer: Making join offers based on membership 2</div><div style="font-size:12.8px">Nov 26 19:07:12 node2 crmd[3523]: crit: join_make_offer: Skipping node2: already known 4</div><div style="font-size:12.8px">Nov 26 19:07:12 node2 crmd[3523]: crit: join_make_offer: about to make an offer to node1</div><div style="font-size:12.8px">Nov 26 19:07:12 node2 crmd[3523]: crit: join_make_offer: join-1: Sending offer to node1</div><div style="font-size:12.8px">Nov 26 19:07:12 node2 crmd[3523]: info: crm_update_peer_join: join_make_offer: Node node1[0] - join-1 phase 0 -> 1</div><div style="font-size:12.8px">Nov 26 19:07:12 node2 crmd[3523]: debug: check_join_state: Invoked by do_dc_join_offer_one in state: S_INTEGRATION</div><div style="font-size:12.8px">Nov 26 19:07:12 node2 crmd[3523]: debug: do_te_invoke: Halting the transition: inactive</div><div style="font-size:12.8px">Nov 26 19:07:12 node2 crmd[3523]: info: abort_transition_graph: do_te_invoke:158 - Triggered transition abort (complete=1) : Peer Halt</div><div style="font-size:12.8px">Nov 26 19:07:12 node2 crmd[3523]: crit: s_crmd_fsa: Processing I_PE_CALC: [ state=S_INTEGRATION cause=C_FSA_INTERNAL origin=abort_transition_graph ]</div><div style="font-size:12.8px"><br></div><div style="font-size:12.8px"><br></div><div style="font-size:12.8px">later crmd on the node that is coming up connects to heartbeat </div><div><div><span style="font-size:12.8px">Nov 26 19:07:15 node1 crmd[2969]: notice: crm_cluster_connect: Connecting to cluster infrastructure: heartbeat</span></div><div><span style="font-size:12.8px">Nov 26 19:07:15 node1 crmd[2969]: debug: register_heartbeat_conn: Signing in with Heartbeat</span></div><div><span style="font-size:12.8px">Nov 26 19:07:15 node1 crmd[2969]: info: crm_get_peer: Created entry 75b26770-ac15-461c-889d-190e6c8139ac/0x7f2fd194f870 for node node1/0 (1 total)</span></div><div><span style="font-size:12.8px">Nov 26 19:07:15 node1 crmd[2969]: crit: peer_update_callback: node1 is now (null)</span></div><div><span style="font-size:12.8px">Nov 26 19:07:15 node1 crmd[2969]: info: crm_get_peer: Node 0 has uuid 00000432-0432-0000-2b91-000000000000</span></div><div><span style="font-size:12.8px">Nov 26 19:07:15 node1 crmd[2969]: info: register_heartbeat_conn: Hostname: node1</span></div><div><span style="font-size:12.8px">Nov 26 19:07:15 node1 crmd[2969]: info: register_heartbeat_conn: UUID: 00000432-0432-0000-2b91-000000000000</span></div><div><span style="font-size:12.8px"><br></span></div><div><span style="font-size:12.8px"><br></span></div><div><span style="font-size:12.8px">& at this point heartbeat delivers a notification into crmd as part of which crmd wrongly moves the status of the incoming node as offline/removing out of membership</span></div><div><div style="font-size:12.8px">Nov 26 19:07:15 node2 crmd[3523]: notice: crmd_client_status_callback: Status update: Client node1/crmd now has status [join] (DC=true)</div><div style="font-size:12.8px">Nov 26 19:07:15 node2 crmd[3523]: info: crm_update_peer_proc: crmd_client_status_callback: Node node1[0] - crmd is now join</div><div style="font-size:12.8px">Nov 26 19:07:15 node2 crmd[3523]: crit: peer_update_callback: Client node1/peer now has status [offline] (DC=true)</div><div style="font-size:12.8px"><br></div><div><div><span style="font-size:12.8px">Nov 26 19:07:15 node2 crmd[3523]: warning: match_down_event: No match for shutdown action on 00000432-0432-0000-2b91-000000000000</span></div><div><span style="font-size:12.8px">Nov 26 19:07:15 node2 crmd[3523]: crit: peer_update_callback: Alive=0, appear=0, down=(nil)</span></div><div><span style="font-size:12.8px">Nov 26 19:07:15 node2 crmd[3523]: notice: peer_update_callback: Stonith/shutdown of node1 not matched</span></div><div><span style="font-size:12.8px">Nov 26 19:07:15 node2 crmd[3523]: info: crm_update_peer_join: peer_update_callback: Node node1[0] - join-1 phase 1 -> 0</span></div><div><span style="font-size:12.8px">Nov 26 19:07:15 node2 crmd[3523]: debug: check_join_state: Invoked by peer_update_callback in state: S_INTEGRATION</span></div><div><span style="font-size:12.8px">Nov 26 19:07:15 node2 crmd[3523]: debug: check_join_state: join-1: Integration of 0 peers complete: peer_update_callback</span></div><div><span style="font-size:12.8px">Nov 26 19:07:15 node2 crmd[3523]: info: abort_transition_graph: peer_update_callback:214 - Triggered transition abort (complete=1) : Node failure</span></div><div><span style="font-size:12.8px"><br></span></div><div><span style="font-size:12.8px"><br></span></div><div><span style="font-size:12.8px">so bottom-line due to a timing issue when crmd connects into hearbeat cluster we wrongly marked in the DC that the incoming node is not a member. this causes any message from incoming node from this point onwards to be ignored like</span></div><div><div><span style="font-size:12.8px"><br></span></div><div><span style="font-size:12.8px">Nov 26 19:07:18 node2 crmd[3523]: debug: crmd_ha_msg_callback: Ignoring HA message (op=noop) from node1: not in our membership list (size=1)</span></div><div><span style="font-size:12.8px">Nov 26 19:07:19 node2 crmd[3523]: warning: crmd_ha_msg_callback: Ignoring HA message (op=join_announce) from node1: not in our membership list (size=1)</span></div><div><span style="font-size:12.8px"><br></span></div><div><span style="font-size:12.8px">--Shyam</span></div></div></div></div></div></div></div></div><div><br><div class="gmail_extra"><br><div class="gmail_quote">On Mon, Nov 23, 2015 at 9:30 PM, Shyam <span dir="ltr"><<a href="mailto:shyam.kaushik@gmail.com" target="_blank">shyam.kaushik@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div dir="ltr">One note on this.<div><br></div><div>This problem doesnt happen if</div><div><br></div><div><div>Nov 19 08:36:30 node1 crmd[3298]: notice: crmd_client_status_callback: Status update: Client node2/crmd now has status [join] (DC=true)</div><div>Nov 19 08:36:30 node1 crmd[3298]: notice: crmd_client_status_callback: Status update: Client node2/crmd now has status [leave] (DC=true)</div><div>Nov 19 08:36:31 node1 crmd[3298]: notice: crmd_client_status_callback: Status update: Client node2/crmd now has status [join] (DC=true)</div></div><div><br></div><div>join happens before </div><div>Nov 19 08:36:34 node1 crmd[3298]: notice: do_state_transition: State transition S_IDLE -> S_INTEGRATION [ input=I_NODE_JOIN cause=C_FSA_INTERNAL origin=peer_update_callback ]<br></div><div><br></div><div>I am not sure why crmd join/leave & then join happens. With previous version of heartbeat/pacemaker this doesnt happen.</div><div><br></div><div>here are more logs when the problem doesnt happen</div><div><br></div><div><div>Nov 19 08:36:29 node1 heartbeat: [3143]: info: Link node2:eth0 up.</div><div>Nov 19 08:36:29 node1 heartbeat: [3143]: info: Status update for node node2: status init</div><div>Nov 19 08:36:29 node1 crmd[3298]: notice: crmd_ha_status_callback: Status update: Node node2 now has status [init]</div><div>Nov 19 08:36:29 node1 crmd[3298]: notice: crmd_ha_status_callback: Status update: Node node2 now has status [up]</div><div>Nov 19 08:36:30 node1 heartbeat: [3143]: debug: get_delnodelist: delnodelist=</div><div>Nov 19 08:36:30 node1 heartbeat: [3143]: info: Status update for node node2: status active</div><div>Nov 19 08:36:30 node1 crmd[3298]: notice: crmd_ha_status_callback: Status update: Node node2 now has status [active]</div><div>Nov 19 08:36:30 node1 crmd[3298]: notice: crmd_client_status_callback: Status update: Client node2/crmd now has status [join] (DC=true)</div><div>Nov 19 08:36:30 node1 crmd[3298]: notice: crmd_client_status_callback: Status update: Client node2/crmd now has status [leave] (DC=true)</div><div>Nov 19 08:36:31 node1 crmd[3298]: notice: crmd_client_status_callback: Status update: Client node2/crmd now has status [join] (DC=true)</div><div>Nov 19 08:36:34 node1 ccm: [3293]: debug: quorum plugin: majority</div><div>Nov 19 08:36:34 node1 ccm: [3293]: debug: cluster:linux-ha, member_count=2, member_quorum_votes=200</div><div>Nov 19 08:36:34 node1 ccm: [3293]: debug: total_node_count=2, total_quorum_votes=200</div><div>Nov 19 08:36:34 node1 cib[3294]: notice: crm_update_peer_state: crm_update_ccm_node: Node node2[1] - state is now member (was (null))</div><div>Nov 19 08:36:34 node1 crmd[3298]: notice: crm_update_peer_state: crm_update_ccm_node: Node node2[1] - state is now member (was lost)</div><div>Nov 19 08:36:34 node1 crmd[3298]: notice: do_state_transition: State transition S_IDLE -> S_INTEGRATION [ input=I_NODE_JOIN cause=C_FSA_INTERNAL origin=peer_update_callback ]</div></div><div><br></div><div>after the above, resources are started alright</div><div><div>Nov 19 08:36:40 node1 attrd[3297]: notice: attrd_local_callback: Sending full refresh (origin=crmd)</div><div>Nov 19 08:36:40 node1 attrd[3297]: notice: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (true)</div><div>Nov 19 08:36:40 node1 attrd[3297]: notice: attrd_trigger_update: Sending flush op to all hosts for: master-MYSQL:0 (100)</div><div>Nov 19 08:36:41 node1 pengine[3299]: notice: unpack_config: On loss of CCM Quorum: Ignore</div><div>Nov 19 08:36:41 node1 pengine[3299]: notice: LogActions: Start IPADDR:1#011(node2)</div></div><div><br></div><div>Any help/pointers greatly apprecited.</div><div><br></div><div>Thanks.</div><span><font color="#888888"><div><br></div><div>--Shyam</div></font></span></div><div><div><div class="gmail_extra"><br><div class="gmail_quote">On Mon, Nov 23, 2015 at 12:14 PM, Shyam <span dir="ltr"><<a href="mailto:shyam.kaushik@gmail.com" target="_blank">shyam.kaushik@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div dir="ltr">Hi All,<div><br></div><div>Need a help w.r.t. a timing issue that I hit when using pacemaker-1.1.10 with heartbeat-3.0.5 on Ubuntu Trusty.</div><div><br></div><div><div>As seen in logs below, issue appears to be w.r.t. membership that CRMD looks at that it wrongly moves to INTEGRATION phase when the peer CRMD hasnt joined the cluster fully yet.</div></div><div><br></div><div>We have a 2 node cluster & occasionally when one of the node comes up after reboot it keeps remaining in S_PENDING state/until heartbeat/pacemaker on that node is restarted. This happens periodically & not always. </div><div><br></div><div>Below logs shows the problem clearly<br></div><div><br></div><div>ndoe1 has been running & sees node2 coming up</div><div><br></div><div><div>Nov 20 09:09:10 node1 crmd[3905]: notice: crmd_ha_status_callback: Status update: Node node2 now has status [init]</div><div>Nov 20 09:09:10 node1 crmd[3905]: notice: crmd_ha_status_callback: Status update: Node node2 now has status [up]</div><div>Nov 20 09:09:10 node1 heartbeat: [2943]: info: Link node2:eth0 up.</div><div>Nov 20 09:09:10 node1 heartbeat: [2943]: info: Status update for node node2: status init</div><div>Nov 20 09:09:10 node1 heartbeat: [2943]: info: Status update for node node2: status up</div><div>Nov 20 09:09:10 node1 heartbeat: [2943]: debug: get_delnodelist: delnodelist=</div><div>Nov 20 09:09:10 node1 heartbeat: [2943]: info: all clients are now paused</div><div>Nov 20 09:09:10 node1 crmd[3905]: notice: crmd_ha_status_callback: Status update: Node node2 now has status [active]</div><div>Nov 20 09:09:10 node1 crmd[3905]: notice: crmd_client_status_callback: Status update: Client node2/crmd now has status [join] (DC=true)</div><div>Nov 20 09:09:10 node1 heartbeat: [2943]: info: Status update for node node2: status active</div><div>Nov 20 09:09:10 node1 crmd[3905]: notice: crmd_client_status_callback: Status update: Client node2/crmd now has status [leave] (DC=true)</div><div>Nov 20 09:09:11 node1 heartbeat: [2943]: info: all clients are now resumed</div></div><div><br></div><div><br></div><div>as can be seen above, w.r.t. CRMD on peer, it registered at heartbeat & de-registered (not sure why)</div><div>Nov 20 09:09:10 node1 crmd[3905]: notice: crmd_client_status_callback: Status update: Client node2/crmd now has status [join] (DC=true)<br></div><div>Nov 20 09:09:10 node1 crmd[3905]: notice: crmd_client_status_callback: Status update: Client node2/crmd now has status [leave] (DC=true)<br></div><div><br></div><div><br></div><div>while this is the state CRMD on node1 moves to INTEGRATION phase & there are join requests flowing through, but presume I cannot be handled by peer</div><div><br></div><div><div>Nov 20 09:09:13 node1 cib[3901]: notice: crm_update_peer_state: crm_update_ccm_node: Node node2[0] - state is now member (was (null))</div><div>Nov 20 09:09:13 node1 crmd[3905]: notice: crm_update_peer_state: crm_update_ccm_node: Node node2[0] - state is now member (was lost)</div><div>Nov 20 09:09:13 node1 crmd[3905]: notice: do_state_transition: State transition S_IDLE -> S_INTEGRATION [ input=I_NODE_JOIN cause=C_FSA_INTERNAL origin=peer_update_callback ]</div><div>Nov 20 09:09:13 node1 ccm: [3900]: debug: quorum plugin: majority</div><div>Nov 20 09:09:13 node1 ccm: [3900]: debug: cluster:linux-ha, member_count=2, member_quorum_votes=200</div><div>Nov 20 09:09:13 node1 ccm: [3900]: debug: total_node_count=2, total_quorum_votes=200</div><div>Nov 20 09:09:13 node1 crmd[3905]: notice: crmd_client_status_callback: Status update: Client node2/crmd now has status [join] (DC=true)</div><div>Nov 20 09:09:13 node1 crmd[3905]: warning: match_down_event: No match for shutdown action on 00000432-0432-0000-2b91-000000000000</div><div>Nov 20 09:09:13 node1 crmd[3905]: notice: peer_update_callback: Stonith/shutdown of node2 not matched</div><div>Nov 20 09:09:13 node1 attrd[3904]: notice: attrd_local_callback: Sending full refresh (origin=crmd)</div><div>Nov 20 09:09:13 node1 attrd[3904]: notice: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (true)</div><div>Nov 20 09:09:13 node1 attrd[3904]: notice: attrd_trigger_update: Sending flush op to all hosts for: master-MYSQL:0 (100)</div><div>Nov 20 09:09:13 node1 pengine[3906]: notice: unpack_config: On loss of CCM Quorum: Ignore</div><div>Nov 20 09:09:13 node1 pengine[3906]: warning: custom_action: Action IPADDR:0_monitor_0 on node2 is unrunnable (pending)</div></div><div><br></div><div><br></div><div>As can be seen we had</div><div>Nov 20 09:09:13 node1 crmd[3905]: notice: do_state_transition: State transition S_IDLE -> S_INTEGRATION [ input=I_NODE_JOIN cause=C_FSA_INTERNAL <br></div><div><br></div><div>& only then CRMD on peer seems to be alright</div><div><div>Nov 20 09:09:13 node1 crmd[3905]: notice: crmd_client_status_callback: Status update: Client node2/crmd now has status [join] (DC=true)</div></div><div><br></div><div>so pengine decided not being able to run any resources & we are stuck in this state. While node2 has this trace</div><div><div>Nov 20 09:09:13 node2 cib[2964]: notice: cib_server_process_diff: Not applying diff 1.3.33 -> 1.3.34 (sync in progress)</div><div>Nov 20 09:09:13 node2 cib[2964]: notice: crm_update_peer_state: crm_update_ccm_node: Node node1[1] - state is now member (was (null))</div><div>Nov 20 09:09:13 node2 cib[2964]: notice: crm_update_peer_state: crm_update_ccm_node: Node node2[0] - state is now member (was (null))</div><div>Nov 20 09:09:13 node2 crmd[2968]: notice: crm_cluster_connect: Connecting to cluster infrastructure: heartbeat</div><div>Nov 20 09:09:14 node2 heartbeat: [2938]: info: the send queue length from heartbeat to client crmd is set to 1024</div><div>Nov 20 09:09:15 node2 crmd[2968]: notice: crmd_client_status_callback: Status update: Client node2/crmd now has status [join] (DC=false)</div><div>Nov 20 09:09:15 node2 crmd[2968]: notice: crmd_client_status_callback: Status update: Client node2/crmd now has status [online] (DC=false)</div><div>Nov 20 09:09:15 node2 crmd[2968]: notice: crmd_client_status_callback: Status update: Client node1/crmd now has status [online] (DC=false)</div><div>Nov 20 09:09:16 node2 crmd[2968]: notice: crm_update_peer_state: crm_update_ccm_node: Node node1[1] - state is now member (was (null))</div><div>Nov 20 09:09:16 node2 crmd[2968]: notice: crm_update_peer_state: crm_update_ccm_node: Node node2[0] - state is now member (was (null))</div><div>Nov 20 09:09:16 node2 crmd[2968]: notice: do_started: The local CRM is operational</div><div>Nov 20 09:09:16 node2 crmd[2968]: notice: do_state_transition: State transition S_STARTING -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL origin=do_started ]</div><div>Nov 20 09:09:27 node2 crmd[2968]: warning: do_log: FSA: Input I_DC_TIMEOUT from crm_timer_popped() received in state S_PENDING</div></div><div><br></div><div><br></div><div>i.e. it moves to S_PENDING state & keeps receiving I_DC_TIMEOUT</div><div><br></div><div>I tried using latest heartbeat, but that doesnt seem to be the problem. Can anyone suggest if this issue has already been fixed in latest pacemaker or any other suggestions how to debug this issue?</div><div><br></div><div>If I enable higher debug level (both in heartbeat/pacemaker), this problem doesnt show up. Any help/pointers on how to go forward is greatly appreciated.</div><div><br></div><div>Thanks!</div><span><font color="#888888"><div><br></div><div>--Shyam</div></font></span></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br></div></div></div>
</blockquote></div><br></div>