[ClusterLabs] heartbeat/pacemaker: stonith-ng does a broadcast that doesnt reach it back resulting stonith_async_timeout_handler() called

Shyam shyam.kaushik at gmail.com
Thu Oct 15 17:21:53 UTC 2015


Hi Dejan,

Thanks a lot for your input!

I cherry picked this commit & this solves the problem. I will raise a
ubuntu launchpad bug for them to pull this correction in trusty stable.
Thanks!

--Shyam

On Thu, Oct 15, 2015 at 1:21 PM, Dejan Muhamedagic <dejanmm at fastmail.fm>
wrote:

> Hi,
>
> On Wed, Oct 14, 2015 at 10:31:40PM +0530, Shyam wrote:
> > Hi all,
> >
> > We were previously using Pacemaker 1.1.6 on Ubuntu Precise. We are
> shifting
> > to Ubuntu Trusty & it has Pacemaker 1.1.10.
> >
> > We have our own STONITH resource & with the pacemaker upgrade this is
> > broken. The problem that I am hitting is not something new & is exactly
> > that was reported in
> > http://clusterlabs.org/pipermail/users/2015-June/000686.html
> >
> > Basically Hearbeat + Pacemaker stack suffers this problem (even with the
> > github mainline code) & since a lot of people are using corosync I think
> > the problem is not felt with.
> >
> > The basic problem is this. Between Pacemaker 1.1.6 to 1.1.10, the
> following
> > change was done
> >
> > fencing/remote.c
> >
> > static void remote_op_done(remote_fencing_op_t * op, xmlNode * data, int
> > rc, int dup)
> >
> >     if (dup == FALSE && safe_str_neq(subt, "broadcast")) {
> >         /* Defer notification until the bcast message arrives */
> >         bcast_result_to_peers(op, rc);
> >         goto remote_op_done_cleanup;
> >
> > remote_op_done() previously used to send a direct reply. Now it was
> changed
> > that it does a broadcast of the result & this broadcast is supposed to
> come
> > back to the same fencing agent upon which finishes the operation. However
> > this broadcast that is being sent by stonithd doesnt reach back itself.
> > This causes eventually crmd to timeout on stonith_async_timeout_handler()
> > to be triggered & keep retrying STONITH. so essentially STONITH is
> broken.
>
> I can vaguelly recall some stonith/heartbeat issue which got
> fixed by Lars:
>
> commit cc34288a7b2276aa238546f4aa92fa79b8cbcf88
> Author: Lars Ellenberg <lars.ellenberg at linbit.com>
> Date:   Tue Dec 9 10:48:33 2014 +0100
>
> Sounds like the same problem.
>
> Thanks,
>
> Dejan
>
>
> > I tried setting up bcast in /etc/ha.d/ha.cf, but it doesnt help. To
> confirm
> > the issue, I removed the above flow for bcast_result_to_peers() & then
> > STONITH works perfectly fine. Any pointers/help appreciated on how to
> > resolve this issue in the right way?
> >
> >
> > From logs
> >
> > CRMD issues STONITH reboot to stonith-ng
> >
> > Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG: Dumping message
> with
> > 19 fields
> > Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[0] :
> > [__name__=stonith_command]
> > Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[1] :
> > [__name__=stonith_command]
> > Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[2] : [t=stonith-ng]
> > Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[3] :
> > [st_async_id=617dce02-057f-42b8-9df2-33bc4e64780d]
> > Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[4] :
> [st_op=st_fence]
> > Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[5] : [st_callid=2]
> > Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[6] : [st_callopt=0]
> > Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[7] :
> > [st_remote_op=617dce02-057f-42b8-9df2-33bc4e64780d]
> > Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[8] :
> [st_target=node1]
> > Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[9] :
> > [st_device_action=reboot]
> > Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[10] :
> [st_origin=node0]
> > Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[11] :
> > [st_clientid=346fea6b-a55d-4873-b630-14287d5bc71e]
> > Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[12] :
> > [st_clientname=crmd.14483]
> > Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[13] :
> [st_timeout=90]
> > Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[14] :
> [st_mode=smart]
> > Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[15] : [dest=node0]
> > Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[16] : [oseq=2]
> > Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[17] :
> > [from_id=stonith-ng]
> > Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[18] :
> > [to_id=stonith-ng]
> >
> >
> > stonith-ng successfully finishes the STONITH
> > Oct 14 14:57:03 node0 stonith-ng[14481]:   notice: log_operation:
> Operation
> > 'reboot' [14657] (call 2 from crmd.14483) for host 'node1' with device
> > 'Z-FENCE:0' returned: 0 (OK)
> > Oct 14 14:57:03 node0 stonith-ng[14481]:    debug:
> > stonith_send_async_reply: Directed reply to node0
> > Oct 14 14:57:03 node0 stonith-ng[14481]:    debug: send_ha_message:
> > outbound   <st-reply st_origin="node0" t="stonith-ng" st_op="st_fence"
> > st_device_id="Z-FENCE:0"
> > st_remote_op="617dce02-057f-42b8-9df2-33bc4e64780d"
> > st_clientid="346fea6b-a55d-4873-b630-14287d5bc71e"
> > st_clientname="crmd.14483" st_target="node1" st_device_action="reboot"
> > st_callid="2" st_callopt="0" st_rc="0" st_output="Performing: stonith -t
> > external/zstonith -T reset node1\nsuccess: node1 0\n"/>
> >
> >
> > & stonith-ng sends a self message for st-reply
> > Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG: Dumping message
> with
> > 18 fields
> > Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[0] :
> > [__name__=st-reply]
> > Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[1] :
> [st_origin=node0]
> > Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[2] : [t=stonith-ng]
> > Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[3] :
> [st_op=st_fence]
> > Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[4] :
> > [st_device_id=Z-FENCE:0]
> > Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[5] :
> > [st_remote_op=617dce02-057f-42b8-9df2-33bc4e64780d]
> > Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[6] :
> > [st_clientid=346fea6b-a55d-4873-b630-14287d5bc71e]
> > Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[7] :
> > [st_clientname=crmd.14483]
> > Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[8] :
> [st_target=node1]
> > Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[9] :
> > [st_device_action=reboot]
> > Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[10] : [st_callid=2]
> > Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[11] : [st_callopt=0]
> > Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[12] : [st_rc=0]
> > Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[13] :
> > [st_output=Performing: stonith -t external/zstonith -T reset
> > node1#012success: node1 0#012]
> > Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[14] : [dest=node0]
> > Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[15] : [oseq=3]
> > Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[16] :
> > [from_id=stonith-ng]
> > Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[17] :
> > [to_id=stonith-ng]
> >
> >
> > As part of this stonith-ng does a broadcast
> > Oct 14 14:57:03 node0 stonith-ng[14481]:    debug: stonith_command:
> > Processing st_fence reply 0 from node0 (               0)
> > Oct 14 14:57:03 node0 stonith-ng[14481]:    debug: remote_op_done:
> > Broadcasting result
> > Oct 14 14:57:03 node0 stonith-ng[14481]:    debug: send_ha_message:
> > outbound   <st-reply t="st_notify" subt="broadcast" st_op="st_notify"
> > count="1">
> > Oct 14 14:57:03 node0 stonith-ng[14481]:    debug: send_ha_message:
> > outbound     <st_calldata>
> > Oct 14 14:57:03 node0 stonith-ng[14481]:    debug: send_ha_message:
> > outbound       <st_notify_fence state="2" st_rc="0" st_target="node1"
> > st_device_action="reboot" st_delegate="node0"
> > st_remote_op="617dce02-057f-42b8-9df2-33bc4e64780d" st_origin="node0"
> > st_clientid="346fea6b-a55d-4873-b630-14287d5bc71e"
> > st_clientname="crmd.14483"/>
> > Oct 14 14:57:03 node0 stonith-ng[14481]:    debug: send_ha_message:
> > outbound     </st_calldata>
> > Oct 14 14:57:03 node0 stonith-ng[14481]:    debug: send_ha_message:
> > outbound   </st-reply>
> >
> >
> > that is realized at hearbeat
> > Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG: Dumping message
> with
> > 8 fields
> > Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[0] :
> > [__name__=st-reply]
> > Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[1] : [t=st_notify]
> > Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[2] :
> [subt=broadcast]
> > Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[3] :
> [st_op=st_notify]
> > Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[4] : [count=1]
> > Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[5] :
> > [st_calldata=<st_calldata><st_notify_fence state="2" st_rc="0"
> > st_target="node1" st_device_action="reboot" st_delegate="node0"
> > st_remote_op="617dce02-057f-42b8-9df2-33bc4e64780d" st_origin="node0"
> > st_clientid="346fea6b-a55d-4873-b630-14287d5bc71e"
> > st_clientname="crmd.14483"/></st_calldata>]
> > Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[6] :
> > [from_id=stonith-ng]
> > Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[7] :
> [to_id=stonith-ng]
> >
> > but then on nothing really happens.
> >
> > eventually crmd gives up on timeout
> > Oct 14 14:59:48 node0 crmd[14483]:    error:
> stonith_async_timeout_handler:
> > Async call 2 timed out after 168000ms
> >
> > Thanks.
> >
> > --Shyam
>
> > _______________________________________________
> > Users mailing list: Users at clusterlabs.org
> > http://clusterlabs.org/mailman/listinfo/users
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20151015/f1bc2139/attachment-0002.html>


More information about the Users mailing list