<div dir="ltr"><div class="gmail_quote"><div dir="ltr"><div class="gmail_extra">Hi all,</div><div class="gmail_extra"><br></div><div class="gmail_extra">We were previously using Pacemaker 1.1.6 on Ubuntu Precise. We are shifting to Ubuntu Trusty & it has Pacemaker 1.1.10.</div><div class="gmail_extra"><br></div><div class="gmail_extra">We have our own STONITH resource & with the pacemaker upgrade this is broken. The problem that I am hitting is not something new & is exactly that was reported in <a href="http://clusterlabs.org/pipermail/users/2015-June/000686.html" target="_blank">http://clusterlabs.org/pipermail/users/2015-June/000686.html</a></div><div class="gmail_extra"><br></div><div class="gmail_extra">Basically Hearbeat + Pacemaker stack suffers this problem (even with the github mainline code) & since a lot of people are using corosync I think the problem is not felt with.</div><div class="gmail_extra"><br></div><div class="gmail_extra">The basic problem is this. Between Pacemaker 1.1.6 to 1.1.10, the following change was done</div><div class="gmail_extra"><br></div><div class="gmail_extra">fencing/remote.c<br></div><div class="gmail_extra"><br></div><div class="gmail_extra">static void remote_op_done(remote_fencing_op_t * op, xmlNode * data, int rc, int dup)</div><div class="gmail_extra"><br></div><div class="gmail_extra"><div class="gmail_extra"> if (dup == FALSE && safe_str_neq(subt, "broadcast")) {</div><div class="gmail_extra"> /* Defer notification until the bcast message arrives */</div><div class="gmail_extra"> bcast_result_to_peers(op, rc);<br></div></div><div class="gmail_extra"> goto remote_op_done_cleanup;<br></div><div class="gmail_extra"><br></div><div class="gmail_extra">remote_op_done() previously used to send a direct reply. Now it was changed that it does a broadcast of the result & this broadcast is supposed to come back to the same fencing agent upon which finishes the operation. However this broadcast that is being sent by stonithd doesnt reach back itself. This causes eventually crmd to timeout on stonith_async_timeout_handler() to be triggered & keep retrying STONITH. so essentially STONITH is broken.</div><div class="gmail_extra"><br></div><div class="gmail_extra">I tried setting up bcast in /etc/ha.d/<a href="http://ha.cf" target="_blank">ha.cf</a>, but it doesnt help. To confirm the issue, I removed the above flow for bcast_result_to_peers() & then STONITH works perfectly fine. Any pointers/help appreciated on how to resolve this issue in the right way?</div><div class="gmail_extra"><br></div><div class="gmail_extra"><br></div><div class="gmail_extra">From logs</div><div class="gmail_extra"><br></div><div class="gmail_extra">CRMD issues STONITH reboot to stonith-ng</div><div class="gmail_extra"><br></div><div class="gmail_extra"><div class="gmail_extra">Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG: Dumping message with 19 fields</div><div class="gmail_extra">Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[0] : [__name__=stonith_command]</div><div class="gmail_extra">Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[1] : [__name__=stonith_command]</div><div class="gmail_extra">Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[2] : [t=stonith-ng]</div><div class="gmail_extra">Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[3] : [st_async_id=617dce02-057f-42b8-9df2-33bc4e64780d]</div><div class="gmail_extra">Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[4] : [st_op=st_fence]</div><div class="gmail_extra">Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[5] : [st_callid=2]</div><div class="gmail_extra">Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[6] : [st_callopt=0]</div><div class="gmail_extra">Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[7] : [st_remote_op=617dce02-057f-42b8-9df2-33bc4e64780d]</div><div class="gmail_extra">Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[8] : [st_target=node1]</div><div class="gmail_extra">Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[9] : [st_device_action=reboot]</div><div class="gmail_extra">Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[10] : [st_origin=node0]</div><div class="gmail_extra">Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[11] : [st_clientid=346fea6b-a55d-4873-b630-14287d5bc71e]</div><div class="gmail_extra">Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[12] : [st_clientname=crmd.14483]</div><div class="gmail_extra">Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[13] : [st_timeout=90]</div><div class="gmail_extra">Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[14] : [st_mode=smart]</div><div class="gmail_extra">Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[15] : [dest=node0]</div><div class="gmail_extra">Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[16] : [oseq=2]</div><div class="gmail_extra">Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[17] : [from_id=stonith-ng]</div><div class="gmail_extra">Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[18] : [to_id=stonith-ng]</div><div><br></div></div><div class="gmail_extra"><br></div><div class="gmail_extra">stonith-ng successfully finishes the STONITH</div><div class="gmail_extra"><div class="gmail_extra">Oct 14 14:57:03 node0 stonith-ng[14481]: notice: log_operation: Operation 'reboot' [14657] (call 2 from crmd.14483) for host 'node1' with device 'Z-FENCE:0' returned: 0 (OK)</div><div class="gmail_extra">Oct 14 14:57:03 node0 stonith-ng[14481]: debug: stonith_send_async_reply: Directed reply to node0</div><div class="gmail_extra">Oct 14 14:57:03 node0 stonith-ng[14481]: debug: send_ha_message: outbound <st-reply st_origin="node0" t="stonith-ng" st_op="st_fence" st_device_id="Z-FENCE:0" st_remote_op="617dce02-057f-42b8-9df2-33bc4e64780d" st_clientid="346fea6b-a55d-4873-b630-14287d5bc71e" st_clientname="crmd.14483" st_target="node1" st_device_action="reboot" st_callid="2" st_callopt="0" st_rc="0" st_output="Performing: stonith -t external/zstonith -T reset node1\nsuccess: node1 0\n"/></div><div class="gmail_extra"><br></div><div class="gmail_extra"><br></div><div class="gmail_extra">& stonith-ng sends a self message for st-reply</div><div class="gmail_extra"><div class="gmail_extra">Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG: Dumping message with 18 fields</div><div class="gmail_extra">Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[0] : [__name__=st-reply]</div><div class="gmail_extra">Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[1] : [st_origin=node0]</div><div class="gmail_extra">Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[2] : [t=stonith-ng]</div><div class="gmail_extra">Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[3] : [st_op=st_fence]</div><div class="gmail_extra">Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[4] : [st_device_id=Z-FENCE:0]</div><div class="gmail_extra">Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[5] : [st_remote_op=617dce02-057f-42b8-9df2-33bc4e64780d]</div><div class="gmail_extra">Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[6] : [st_clientid=346fea6b-a55d-4873-b630-14287d5bc71e]</div><div class="gmail_extra">Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[7] : [st_clientname=crmd.14483]</div><div class="gmail_extra">Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[8] : [st_target=node1]</div><div class="gmail_extra">Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[9] : [st_device_action=reboot]</div><div class="gmail_extra">Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[10] : [st_callid=2]</div><div class="gmail_extra">Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[11] : [st_callopt=0]</div><div class="gmail_extra">Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[12] : [st_rc=0]</div><div class="gmail_extra">Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[13] : [st_output=Performing: stonith -t external/zstonith -T reset node1#012success: node1 0#012]</div><div class="gmail_extra">Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[14] : [dest=node0]</div><div class="gmail_extra">Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[15] : [oseq=3]</div><div class="gmail_extra">Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[16] : [from_id=stonith-ng]</div><div class="gmail_extra">Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[17] : [to_id=stonith-ng]</div><div class="gmail_extra"><br></div><div class="gmail_extra"><br></div><div class="gmail_extra">As part of this stonith-ng does a broadcast</div><div class="gmail_extra"><div class="gmail_extra">Oct 14 14:57:03 node0 stonith-ng[14481]: debug: stonith_command: Processing st_fence reply 0 from node0 ( 0)</div><div class="gmail_extra">Oct 14 14:57:03 node0 stonith-ng[14481]: debug: remote_op_done: Broadcasting result</div><div class="gmail_extra">Oct 14 14:57:03 node0 stonith-ng[14481]: debug: send_ha_message: outbound <st-reply t="st_notify" subt="broadcast" st_op="st_notify" count="1"></div><div class="gmail_extra">Oct 14 14:57:03 node0 stonith-ng[14481]: debug: send_ha_message: outbound <st_calldata></div><div class="gmail_extra">Oct 14 14:57:03 node0 stonith-ng[14481]: debug: send_ha_message: outbound <st_notify_fence state="2" st_rc="0" st_target="node1" st_device_action="reboot" st_delegate="node0" st_remote_op="617dce02-057f-42b8-9df2-33bc4e64780d" st_origin="node0" st_clientid="346fea6b-a55d-4873-b630-14287d5bc71e" st_clientname="crmd.14483"/></div><div class="gmail_extra">Oct 14 14:57:03 node0 stonith-ng[14481]: debug: send_ha_message: outbound </st_calldata></div><div class="gmail_extra">Oct 14 14:57:03 node0 stonith-ng[14481]: debug: send_ha_message: outbound </st-reply></div><div class="gmail_extra"><br></div><div class="gmail_extra"><br></div><div class="gmail_extra">that is realized at hearbeat</div><div class="gmail_extra"><div class="gmail_extra">Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG: Dumping message with 8 fields</div><div class="gmail_extra">Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[0] : [__name__=st-reply]</div><div class="gmail_extra">Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[1] : [t=st_notify]</div><div class="gmail_extra">Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[2] : [subt=broadcast]</div><div class="gmail_extra">Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[3] : [st_op=st_notify]</div><div class="gmail_extra">Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[4] : [count=1]</div><div class="gmail_extra">Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[5] : [st_calldata=<st_calldata><st_notify_fence state="2" st_rc="0" st_target="node1" st_device_action="reboot" st_delegate="node0" st_remote_op="617dce02-057f-42b8-9df2-33bc4e64780d" st_origin="node0" st_clientid="346fea6b-a55d-4873-b630-14287d5bc71e" st_clientname="crmd.14483"/></st_calldata>]</div><div class="gmail_extra">Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[6] : [from_id=stonith-ng]</div><div class="gmail_extra">Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[7] : [to_id=stonith-ng]</div><div class="gmail_extra"><br></div><div class="gmail_extra">but then on nothing really happens.</div><div class="gmail_extra"><br></div><div class="gmail_extra">eventually crmd gives up on timeout</div><div class="gmail_extra">Oct 14 14:59:48 node0 crmd[14483]: error: stonith_async_timeout_handler: Async call 2 timed out after 168000ms<br></div><div class="gmail_extra"><br></div><div class="gmail_extra">Thanks.</div><span class="HOEnZb"><font color="#888888"><div class="gmail_extra"><br></div><div class="gmail_extra">--Shyam</div></font></span></div></div></div></div></div>
</div><br></div>