[ClusterLabs] heartbeat/pacemaker: stonith-ng does a broadcast that doesnt reach it back resulting stonith_async_timeout_handler() called

Dejan Muhamedagic dejanmm at fastmail.fm
Thu Oct 15 07:51:40 UTC 2015


Hi,

On Wed, Oct 14, 2015 at 10:31:40PM +0530, Shyam wrote:
> Hi all,
> 
> We were previously using Pacemaker 1.1.6 on Ubuntu Precise. We are shifting
> to Ubuntu Trusty & it has Pacemaker 1.1.10.
> 
> We have our own STONITH resource & with the pacemaker upgrade this is
> broken. The problem that I am hitting is not something new & is exactly
> that was reported in
> http://clusterlabs.org/pipermail/users/2015-June/000686.html
> 
> Basically Hearbeat + Pacemaker stack suffers this problem (even with the
> github mainline code) & since a lot of people are using corosync I think
> the problem is not felt with.
> 
> The basic problem is this. Between Pacemaker 1.1.6 to 1.1.10, the following
> change was done
> 
> fencing/remote.c
> 
> static void remote_op_done(remote_fencing_op_t * op, xmlNode * data, int
> rc, int dup)
> 
>     if (dup == FALSE && safe_str_neq(subt, "broadcast")) {
>         /* Defer notification until the bcast message arrives */
>         bcast_result_to_peers(op, rc);
>         goto remote_op_done_cleanup;
> 
> remote_op_done() previously used to send a direct reply. Now it was changed
> that it does a broadcast of the result & this broadcast is supposed to come
> back to the same fencing agent upon which finishes the operation. However
> this broadcast that is being sent by stonithd doesnt reach back itself.
> This causes eventually crmd to timeout on stonith_async_timeout_handler()
> to be triggered & keep retrying STONITH. so essentially STONITH is broken.

I can vaguelly recall some stonith/heartbeat issue which got
fixed by Lars:

commit cc34288a7b2276aa238546f4aa92fa79b8cbcf88
Author: Lars Ellenberg <lars.ellenberg at linbit.com>
Date:   Tue Dec 9 10:48:33 2014 +0100

Sounds like the same problem.

Thanks,

Dejan


> I tried setting up bcast in /etc/ha.d/ha.cf, but it doesnt help. To confirm
> the issue, I removed the above flow for bcast_result_to_peers() & then
> STONITH works perfectly fine. Any pointers/help appreciated on how to
> resolve this issue in the right way?
> 
> 
> From logs
> 
> CRMD issues STONITH reboot to stonith-ng
> 
> Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG: Dumping message with
> 19 fields
> Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[0] :
> [__name__=stonith_command]
> Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[1] :
> [__name__=stonith_command]
> Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[2] : [t=stonith-ng]
> Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[3] :
> [st_async_id=617dce02-057f-42b8-9df2-33bc4e64780d]
> Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[4] : [st_op=st_fence]
> Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[5] : [st_callid=2]
> Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[6] : [st_callopt=0]
> Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[7] :
> [st_remote_op=617dce02-057f-42b8-9df2-33bc4e64780d]
> Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[8] : [st_target=node1]
> Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[9] :
> [st_device_action=reboot]
> Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[10] : [st_origin=node0]
> Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[11] :
> [st_clientid=346fea6b-a55d-4873-b630-14287d5bc71e]
> Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[12] :
> [st_clientname=crmd.14483]
> Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[13] : [st_timeout=90]
> Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[14] : [st_mode=smart]
> Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[15] : [dest=node0]
> Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[16] : [oseq=2]
> Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[17] :
> [from_id=stonith-ng]
> Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[18] :
> [to_id=stonith-ng]
> 
> 
> stonith-ng successfully finishes the STONITH
> Oct 14 14:57:03 node0 stonith-ng[14481]:   notice: log_operation: Operation
> 'reboot' [14657] (call 2 from crmd.14483) for host 'node1' with device
> 'Z-FENCE:0' returned: 0 (OK)
> Oct 14 14:57:03 node0 stonith-ng[14481]:    debug:
> stonith_send_async_reply: Directed reply to node0
> Oct 14 14:57:03 node0 stonith-ng[14481]:    debug: send_ha_message:
> outbound   <st-reply st_origin="node0" t="stonith-ng" st_op="st_fence"
> st_device_id="Z-FENCE:0"
> st_remote_op="617dce02-057f-42b8-9df2-33bc4e64780d"
> st_clientid="346fea6b-a55d-4873-b630-14287d5bc71e"
> st_clientname="crmd.14483" st_target="node1" st_device_action="reboot"
> st_callid="2" st_callopt="0" st_rc="0" st_output="Performing: stonith -t
> external/zstonith -T reset node1\nsuccess: node1 0\n"/>
> 
> 
> & stonith-ng sends a self message for st-reply
> Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG: Dumping message with
> 18 fields
> Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[0] :
> [__name__=st-reply]
> Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[1] : [st_origin=node0]
> Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[2] : [t=stonith-ng]
> Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[3] : [st_op=st_fence]
> Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[4] :
> [st_device_id=Z-FENCE:0]
> Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[5] :
> [st_remote_op=617dce02-057f-42b8-9df2-33bc4e64780d]
> Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[6] :
> [st_clientid=346fea6b-a55d-4873-b630-14287d5bc71e]
> Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[7] :
> [st_clientname=crmd.14483]
> Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[8] : [st_target=node1]
> Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[9] :
> [st_device_action=reboot]
> Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[10] : [st_callid=2]
> Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[11] : [st_callopt=0]
> Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[12] : [st_rc=0]
> Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[13] :
> [st_output=Performing: stonith -t external/zstonith -T reset
> node1#012success: node1 0#012]
> Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[14] : [dest=node0]
> Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[15] : [oseq=3]
> Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[16] :
> [from_id=stonith-ng]
> Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[17] :
> [to_id=stonith-ng]
> 
> 
> As part of this stonith-ng does a broadcast
> Oct 14 14:57:03 node0 stonith-ng[14481]:    debug: stonith_command:
> Processing st_fence reply 0 from node0 (               0)
> Oct 14 14:57:03 node0 stonith-ng[14481]:    debug: remote_op_done:
> Broadcasting result
> Oct 14 14:57:03 node0 stonith-ng[14481]:    debug: send_ha_message:
> outbound   <st-reply t="st_notify" subt="broadcast" st_op="st_notify"
> count="1">
> Oct 14 14:57:03 node0 stonith-ng[14481]:    debug: send_ha_message:
> outbound     <st_calldata>
> Oct 14 14:57:03 node0 stonith-ng[14481]:    debug: send_ha_message:
> outbound       <st_notify_fence state="2" st_rc="0" st_target="node1"
> st_device_action="reboot" st_delegate="node0"
> st_remote_op="617dce02-057f-42b8-9df2-33bc4e64780d" st_origin="node0"
> st_clientid="346fea6b-a55d-4873-b630-14287d5bc71e"
> st_clientname="crmd.14483"/>
> Oct 14 14:57:03 node0 stonith-ng[14481]:    debug: send_ha_message:
> outbound     </st_calldata>
> Oct 14 14:57:03 node0 stonith-ng[14481]:    debug: send_ha_message:
> outbound   </st-reply>
> 
> 
> that is realized at hearbeat
> Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG: Dumping message with
> 8 fields
> Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[0] :
> [__name__=st-reply]
> Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[1] : [t=st_notify]
> Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[2] : [subt=broadcast]
> Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[3] : [st_op=st_notify]
> Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[4] : [count=1]
> Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[5] :
> [st_calldata=<st_calldata><st_notify_fence state="2" st_rc="0"
> st_target="node1" st_device_action="reboot" st_delegate="node0"
> st_remote_op="617dce02-057f-42b8-9df2-33bc4e64780d" st_origin="node0"
> st_clientid="346fea6b-a55d-4873-b630-14287d5bc71e"
> st_clientname="crmd.14483"/></st_calldata>]
> Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[6] :
> [from_id=stonith-ng]
> Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[7] : [to_id=stonith-ng]
> 
> but then on nothing really happens.
> 
> eventually crmd gives up on timeout
> Oct 14 14:59:48 node0 crmd[14483]:    error: stonith_async_timeout_handler:
> Async call 2 timed out after 168000ms
> 
> Thanks.
> 
> --Shyam

> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org





More information about the Users mailing list