[ClusterLabs] STONITH not communicated back to initiator until token expires

Chris Walker christopher.walker at gmail.com
Mon Mar 13 17:07:18 CET 2017


Hello,

On our two-node EL7 cluster (pacemaker: 1.1.15-11.el7_3.4; corosync:
2.4.0-4; libqb: 1.0-1),
it looks like successful STONITH operations are not communicated from
stonith-ng back to theinitiator (in this case, crmd) until the STONITHed
node is removed from the cluster when
Corosync notices that it's gone (i.e., after the token timeout).

In trace debug logs, I see the STONITH reply sent via the cpg_mcast_joined
(libqb) function in crm_cs_flush
(stonith_send_async_reply->send_cluster_text->send_cluster_text->send_cpg_iov->crm_cs_flush->cpg_mcast_joined)

Mar 13 07:18:22 [6466] bug0 stonith-ng: (  commands.c:1891  )   trace:
stonith_send_async_reply:        Reply   <st-reply st_origin="bug1"
t="stonith-ng" st_op="st_fence" st_device_id="ustonith:0"
st_remote_op="39b1f1e0-b76f-4d25-bd15-77b956c914a0"
st_clientid="823e92da-8470-491a-b662-215526cced22"
st_clientname="crmd.3973" st_target="bug1" st_device_action="st_fence"
st_callid="3" st_callopt="0" st_rc="0" st_output="Chassis Power Control:
Reset\nChassis Power Control: Down/Off\nChassis Power Control: Down/Off\nC
Mar 13 07:18:22 [6466] bug0 stonith-ng: (       cpg.c:636   )   trace:
send_cluster_text:       Queueing CPG message 9 to all (1041 bytes, 449
bytes payload): <st-reply st_origin="bug1" t="stonith-ng" st_op="st_notify"
st_device_id="ustonith:0"
st_remote_op="39b1f1e0-b76f-4d25-bd15-77b956c914a0"
st_clientid="823e92da-8470-491a-b662-215526cced22" st_clientna
Mar 13 07:18:22 [6466] bug0 stonith-ng: (       cpg.c:207   )   trace:
send_cpg_iov:    Queueing CPG message 9 (1041 bytes)
Mar 13 07:18:22 [6466] bug0 stonith-ng: (       cpg.c:170   )   trace:
crm_cs_flush:    CPG message sent, size=1041
Mar 13 07:18:22 [6466] bug0 stonith-ng: (       cpg.c:185   )   trace:
crm_cs_flush:    Sent 1 CPG messages  (0 remaining, last=9): OK (1)

But I see no further action from stonith-ng until minutes later;
specifically, I don't see remote_op_done run, so the dead node is still an
'online (unclean)' member of the array and no failover can take place.

When the token expires (yes, we use a very long token), I see the following:

Mar 13 07:22:11 [6466] bug0 stonith-ng: (membership.c:1018  )  notice:
crm_update_peer_state_iter:      Node bug1 state is now lost | nodeid=2
previous=member source=crm_update_peer_proc
Mar 13 07:22:11 [6466] bug0 stonith-ng: (      main.c:1275  )   debug:
st_peer_update_callback: Broadcasting our uname because of node 2
Mar 13 07:22:11 [6466] bug0 stonith-ng: (       cpg.c:636   )   trace:
send_cluster_text:       Queueing CPG message 10 to all (666 bytes, 74
bytes payload): <stonith_command __name__="stonith_command" t="stonith-ng"
st_op="poke"/>
...
Mar 13 07:22:11 [6466] bug0 stonith-ng: (  commands.c:2582  )   debug:
stonith_command: Processing st_notify reply 0 from bug0 (               0)
Mar 13 07:22:11 [6466] bug0 stonith-ng: (    remote.c:1945  )   debug:
process_remote_stonith_exec:     Marking call to poweroff for bug1 on
behalf of crmd.3973 at 39b1f1e0-b76f-4d25-bd15-77b956c914a0.bug1: OK (0)

and the STONITH command is finally communicated back to crmd as complete
and failover commences.

Is this delay a feature of the cpg_mcast_joined function?  If I understand
correctly (unlikely), it looks like cpg_mcast_joined is not completing
because one of the nodes in the group is missing, but I haven't looked at
that code closely yet.  Is it advisable to have stonith-ng broadcast a
membership change when it successfully fences a node?

Attaching logs with PCMK_debug=stonith-ng
and PCMK_trace_functions=stonith_send_async_reply,send_cluster_text,send_cpg_iov,crm_cs_flush

Thanks in advance,
Chris
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20170313/a21733ce/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pacemaker.log.bz2
Type: application/x-bzip2
Size: 20597 bytes
Desc: not available
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20170313/a21733ce/attachment-0001.bz2>


More information about the Users mailing list