<div dir="ltr"><div class="gmail_extra">Hello,</div><div class="gmail_extra"><br></div><div class="gmail_extra"><div class="gmail_extra">On our two-node EL7 cluster (pacemaker: 1.1.15-11.el7_3.4; corosync: 2.4.0-4; libqb: 1.0-1),</div><div class="gmail_extra">it looks like successful STONITH operations are not communicated from stonith-ng back to theinitiator (in this case, crmd) until the STONITHed node is removed from the cluster when</div><div class="gmail_extra">Corosync notices that it's gone (i.e., after the token timeout).</div><div class="gmail_extra"><br></div><div class="gmail_extra">In trace debug logs, I see the STONITH reply sent via the cpg_mcast_joined (libqb) function in crm_cs_flush (stonith_send_async_reply->send_cluster_text->send_cluster_text->send_cpg_iov->crm_cs_flush->cpg_mcast_joined)</div><div class="gmail_extra"><br></div><div class="gmail_extra">Mar 13 07:18:22 [6466] bug0 stonith-ng: (  commands.c:1891  )   trace: stonith_send_async_reply:        Reply   <st-reply st_origin="bug1" t="stonith-ng" st_op="st_fence" st_device_id="ustonith:0" st_remote_op="39b1f1e0-b76f-4d25-bd15-77b956c914a0" st_clientid="823e92da-8470-491a-b662-215526cced22" st_clientname="crmd.3973" st_target="bug1" st_device_action="st_fence" st_callid="3" st_callopt="0" st_rc="0" st_output="Chassis Power Control: Reset\nChassis Power Control: Down/Off\nChassis Power Control: Down/Off\nC</div><div class="gmail_extra">Mar 13 07:18:22 [6466] bug0 stonith-ng: (       cpg.c:636   )   trace: send_cluster_text:       Queueing CPG message 9 to all (1041 bytes, 449 bytes payload): <st-reply st_origin="bug1" t="stonith-ng" st_op="st_notify" st_device_id="ustonith:0" st_remote_op="39b1f1e0-b76f-4d25-bd15-77b956c914a0" st_clientid="823e92da-8470-491a-b662-215526cced22" st_clientna</div><div class="gmail_extra">Mar 13 07:18:22 [6466] bug0 stonith-ng: (       cpg.c:207   )   trace: send_cpg_iov:    Queueing CPG message 9 (1041 bytes)</div><div class="gmail_extra">Mar 13 07:18:22 [6466] bug0 stonith-ng: (       cpg.c:170   )   trace: crm_cs_flush:    CPG message sent, size=1041</div><div class="gmail_extra">Mar 13 07:18:22 [6466] bug0 stonith-ng: (       cpg.c:185   )   trace: crm_cs_flush:    Sent 1 CPG messages  (0 remaining, last=9): OK (1)</div><div class="gmail_extra"><br></div><div class="gmail_extra">But I see no further action from stonith-ng until minutes later; specifically, I don't see remote_op_done run, so the dead node is still an 'online (unclean)' member of the array and no failover can take place.</div><div class="gmail_extra"><br></div><div class="gmail_extra">When the token expires (yes, we use a very long token), I see the following:</div><div class="gmail_extra"><br></div><div class="gmail_extra">Mar 13 07:22:11 [6466] bug0 stonith-ng: (membership.c:1018  )  notice: crm_update_peer_state_iter:      Node bug1 state is now lost | nodeid=2 previous=member source=crm_update_peer_proc</div><div class="gmail_extra">Mar 13 07:22:11 [6466] bug0 stonith-ng: (      main.c:1275  )   debug: st_peer_update_callback: Broadcasting our uname because of node 2</div><div class="gmail_extra">Mar 13 07:22:11 [6466] bug0 stonith-ng: (       cpg.c:636   )   trace: send_cluster_text:       Queueing CPG message 10 to all (666 bytes, 74 bytes payload): <stonith_command __name__="stonith_command" t="stonith-ng" st_op="poke"/></div><div class="gmail_extra">...</div><div class="gmail_extra">Mar 13 07:22:11 [6466] bug0 stonith-ng: (  commands.c:2582  )   debug: stonith_command: Processing st_notify reply 0 from bug0 (               0)</div><div class="gmail_extra">Mar 13 07:22:11 [6466] bug0 stonith-ng: (    remote.c:1945  )   debug: process_remote_stonith_exec:     Marking call to poweroff for bug1 on behalf of crmd.3973@39b1f1e0-b76f-4d25-bd15-77b956c914a0.bug1: OK (0)</div><div class="gmail_extra"><br></div><div class="gmail_extra">and the STONITH command is finally communicated back to crmd as complete and failover commences.</div><div class="gmail_extra"><br></div><div class="gmail_extra">Is this delay a feature of the cpg_mcast_joined function?  If I understand correctly (unlikely), it looks like cpg_mcast_joined is not completing because one of the nodes in the group is missing, but I haven't looked at that code closely yet.  Is it advisable to have stonith-ng broadcast a membership change when it successfully fences a node?</div><div class="gmail_extra"><br></div><div class="gmail_extra">Attaching logs with PCMK_debug=stonith-ng and PCMK_trace_functions=stonith_send_async_reply,send_cluster_text,send_cpg_iov,crm_cs_flush</div><div class="gmail_extra"><br></div><div class="gmail_extra">Thanks in advance,</div><div class="gmail_extra">Chris</div></div></div>