[ClusterLabs] heartbeat/pacemaker: stonith-ng does a broadcast that doesnt reach it back resulting stonith_async_timeout_handler() called

Shyam shyam.kaushik at gmail.com
Wed Oct 14 17:01:40 UTC 2015


Hi all,

We were previously using Pacemaker 1.1.6 on Ubuntu Precise. We are shifting
to Ubuntu Trusty & it has Pacemaker 1.1.10.

We have our own STONITH resource & with the pacemaker upgrade this is
broken. The problem that I am hitting is not something new & is exactly
that was reported in
http://clusterlabs.org/pipermail/users/2015-June/000686.html

Basically Hearbeat + Pacemaker stack suffers this problem (even with the
github mainline code) & since a lot of people are using corosync I think
the problem is not felt with.

The basic problem is this. Between Pacemaker 1.1.6 to 1.1.10, the following
change was done

fencing/remote.c

static void remote_op_done(remote_fencing_op_t * op, xmlNode * data, int
rc, int dup)

    if (dup == FALSE && safe_str_neq(subt, "broadcast")) {
        /* Defer notification until the bcast message arrives */
        bcast_result_to_peers(op, rc);
        goto remote_op_done_cleanup;

remote_op_done() previously used to send a direct reply. Now it was changed
that it does a broadcast of the result & this broadcast is supposed to come
back to the same fencing agent upon which finishes the operation. However
this broadcast that is being sent by stonithd doesnt reach back itself.
This causes eventually crmd to timeout on stonith_async_timeout_handler()
to be triggered & keep retrying STONITH. so essentially STONITH is broken.

I tried setting up bcast in /etc/ha.d/ha.cf, but it doesnt help. To confirm
the issue, I removed the above flow for bcast_result_to_peers() & then
STONITH works perfectly fine. Any pointers/help appreciated on how to
resolve this issue in the right way?


>From logs

CRMD issues STONITH reboot to stonith-ng

Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG: Dumping message with
19 fields
Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[0] :
[__name__=stonith_command]
Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[1] :
[__name__=stonith_command]
Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[2] : [t=stonith-ng]
Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[3] :
[st_async_id=617dce02-057f-42b8-9df2-33bc4e64780d]
Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[4] : [st_op=st_fence]
Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[5] : [st_callid=2]
Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[6] : [st_callopt=0]
Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[7] :
[st_remote_op=617dce02-057f-42b8-9df2-33bc4e64780d]
Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[8] : [st_target=node1]
Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[9] :
[st_device_action=reboot]
Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[10] : [st_origin=node0]
Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[11] :
[st_clientid=346fea6b-a55d-4873-b630-14287d5bc71e]
Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[12] :
[st_clientname=crmd.14483]
Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[13] : [st_timeout=90]
Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[14] : [st_mode=smart]
Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[15] : [dest=node0]
Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[16] : [oseq=2]
Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[17] :
[from_id=stonith-ng]
Oct 14 14:57:01 node0 heartbeat: [14463]: debug: MSG[18] :
[to_id=stonith-ng]


stonith-ng successfully finishes the STONITH
Oct 14 14:57:03 node0 stonith-ng[14481]:   notice: log_operation: Operation
'reboot' [14657] (call 2 from crmd.14483) for host 'node1' with device
'Z-FENCE:0' returned: 0 (OK)
Oct 14 14:57:03 node0 stonith-ng[14481]:    debug:
stonith_send_async_reply: Directed reply to node0
Oct 14 14:57:03 node0 stonith-ng[14481]:    debug: send_ha_message:
outbound   <st-reply st_origin="node0" t="stonith-ng" st_op="st_fence"
st_device_id="Z-FENCE:0"
st_remote_op="617dce02-057f-42b8-9df2-33bc4e64780d"
st_clientid="346fea6b-a55d-4873-b630-14287d5bc71e"
st_clientname="crmd.14483" st_target="node1" st_device_action="reboot"
st_callid="2" st_callopt="0" st_rc="0" st_output="Performing: stonith -t
external/zstonith -T reset node1\nsuccess: node1 0\n"/>


& stonith-ng sends a self message for st-reply
Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG: Dumping message with
18 fields
Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[0] :
[__name__=st-reply]
Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[1] : [st_origin=node0]
Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[2] : [t=stonith-ng]
Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[3] : [st_op=st_fence]
Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[4] :
[st_device_id=Z-FENCE:0]
Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[5] :
[st_remote_op=617dce02-057f-42b8-9df2-33bc4e64780d]
Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[6] :
[st_clientid=346fea6b-a55d-4873-b630-14287d5bc71e]
Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[7] :
[st_clientname=crmd.14483]
Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[8] : [st_target=node1]
Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[9] :
[st_device_action=reboot]
Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[10] : [st_callid=2]
Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[11] : [st_callopt=0]
Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[12] : [st_rc=0]
Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[13] :
[st_output=Performing: stonith -t external/zstonith -T reset
node1#012success: node1 0#012]
Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[14] : [dest=node0]
Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[15] : [oseq=3]
Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[16] :
[from_id=stonith-ng]
Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[17] :
[to_id=stonith-ng]


As part of this stonith-ng does a broadcast
Oct 14 14:57:03 node0 stonith-ng[14481]:    debug: stonith_command:
Processing st_fence reply 0 from node0 (               0)
Oct 14 14:57:03 node0 stonith-ng[14481]:    debug: remote_op_done:
Broadcasting result
Oct 14 14:57:03 node0 stonith-ng[14481]:    debug: send_ha_message:
outbound   <st-reply t="st_notify" subt="broadcast" st_op="st_notify"
count="1">
Oct 14 14:57:03 node0 stonith-ng[14481]:    debug: send_ha_message:
outbound     <st_calldata>
Oct 14 14:57:03 node0 stonith-ng[14481]:    debug: send_ha_message:
outbound       <st_notify_fence state="2" st_rc="0" st_target="node1"
st_device_action="reboot" st_delegate="node0"
st_remote_op="617dce02-057f-42b8-9df2-33bc4e64780d" st_origin="node0"
st_clientid="346fea6b-a55d-4873-b630-14287d5bc71e"
st_clientname="crmd.14483"/>
Oct 14 14:57:03 node0 stonith-ng[14481]:    debug: send_ha_message:
outbound     </st_calldata>
Oct 14 14:57:03 node0 stonith-ng[14481]:    debug: send_ha_message:
outbound   </st-reply>


that is realized at hearbeat
Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG: Dumping message with
8 fields
Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[0] :
[__name__=st-reply]
Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[1] : [t=st_notify]
Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[2] : [subt=broadcast]
Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[3] : [st_op=st_notify]
Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[4] : [count=1]
Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[5] :
[st_calldata=<st_calldata><st_notify_fence state="2" st_rc="0"
st_target="node1" st_device_action="reboot" st_delegate="node0"
st_remote_op="617dce02-057f-42b8-9df2-33bc4e64780d" st_origin="node0"
st_clientid="346fea6b-a55d-4873-b630-14287d5bc71e"
st_clientname="crmd.14483"/></st_calldata>]
Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[6] :
[from_id=stonith-ng]
Oct 14 14:57:03 node0 heartbeat: [14463]: debug: MSG[7] : [to_id=stonith-ng]

but then on nothing really happens.

eventually crmd gives up on timeout
Oct 14 14:59:48 node0 crmd[14483]:    error: stonith_async_timeout_handler:
Async call 2 timed out after 168000ms

Thanks.

--Shyam
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20151014/598b08fb/attachment-0003.html>


More information about the Users mailing list