[ClusterLabs] STONITH not communicated back to initiator until token expires

Wed Apr 26 18:57:47 CEST 2017

Just to close the loop on this issue, discussions with Redhat have
confirmed that this behavior is as designed, that all membership
changes must first be realized by the Corosync layer.  So the full
trajectory of a STONITH action in response to, for example, a failed
stop operation looks like:

crmd requests STONITH
stonith-ng successfully STONITHs node
<wait for token to time out>
corosync communicates membership change to stonith-ng
stonith-ng communicates successful STONITH to crmd
cluster reacts to down node

Thanks,
Chris

On Wed, Apr 5, 2017 at 5:07 PM, Chris Walker
<christopher.walker at gmail.com> wrote:
> Thanks very much for your reply Ken.  Unfortunately, the same delay
> happens when the DC is not the node that's being STONITHed. In either
> case, the failure looks the same to me: the stonithd instance that
> does the STONITH operation does not pass back the result to the
> original stonithd, so remote_op_done can't be invoked to send the
> result to the original initiator (in this case, crmd).
>
> Sorry, this problem is probably too in-depth for the mailing list.
> I've created RH ticket 01812422 for this issue (seems stuck in L1/L2
> support at the moment :( )
>
> Thanks again,
> Chris
>
>
>
> On Tue, Apr 4, 2017 at 12:47 PM, Ken Gaillot <kgaillot at redhat.com> wrote:
>> On 03/13/2017 10:43 PM, Chris Walker wrote:
>>> Thanks for your reply Digimer.
>>>
>>> On Mon, Mar 13, 2017 at 1:35 PM, Digimer <lists at alteeve.ca
>>> <mailto:lists at alteeve.ca>> wrote:
>>>
>>>     On 13/03/17 12:07 PM, Chris Walker wrote:
>>>     > Hello,
>>>     >
>>>     > On our two-node EL7 cluster (pacemaker: 1.1.15-11.el7_3.4; corosync:
>>>     > 2.4.0-4; libqb: 1.0-1),
>>>     > it looks like successful STONITH operations are not communicated from
>>>     > stonith-ng back to theinitiator (in this case, crmd) until the STONITHed
>>>     > node is removed from the cluster when
>>>     > Corosync notices that it's gone (i.e., after the token timeout).
>>>
>>>     Others might have more useful info, but my understanding of a lost node
>>>     sequence is this;
>>>
>>>     1. Node stops responding, corosync declares it lost after token timeout
>>>     2. Corosync reforms the cluster with remaining node(s), checks if it is
>>>     quorate (always true in 2-node)
>>>     3. Corosync informs Pacemaker of the membership change.
>>>     4. Pacemaker invokes stonith, waits for the fence agent to return
>>>     "success" (exit code of the agent as per the FenceAgentAPI
>>>     [https://docs.pagure.org/ClusterLabs.fence-agents/FenceAgentAPI.md]
>>>     <https://docs.pagure.org/ClusterLabs.fence-agents/FenceAgentAPI.md%5D>).
>>>     If
>>>     the method fails, it moves on to the next method. If all methods fail,
>>>     it goes back to the first method and tries again, looping indefinitely.
>>>
>>>
>>> That's roughly my understanding as well for the case when a node
>>> suddenly leaves the cluster (e.g., poweroff), and this case is working
>>> as expected for me.  I'm seeing delays when a node is marked for STONITH
>>> while it's still up (e.g., after a stop operation fails).  In this case,
>>> what I expect to see is something like:
>>> 1.  crmd requests that stonith-ng fence the node
>>> 2.  stonith-ng (might be a different stonith-ng) fences the node and
>>> sends a message that it has succeeded
>>> 3.  stonith-ng (the original from step 1) receives this message and
>>> communicates back to crmd that the node has been fenced
>>>
>>> but what I'm seeing is
>>> 1.  crmd requests that stonith-ng fence the node
>>> 2.  stonith-ng fences the node and sends a message saying that it has
>>> succeeded
>>> 3.  nobody hears this message
>>> 4.  Corosync eventually realizes that the fenced node is no longer part
>>> of the config and broadcasts a config change
>>> 5.  stonith-ng finishes the STONITH operation that was started earlier
>>> and communicates back to crmd that the node has been STONITHed
>>
>> In your attached log, bug1 was DC at the time of the fencing, and bug0
>> takes over DC after the fencing. This is what I expect is happening
>> (logs from bug1 would help confirm):
>>
>> 1. crmd on the DC (bug1) runs pengine which sees the stop failure and
>> schedules fencing (of bug1)
>>
>> 2. stonithd on bug1 sends a query to all nodes asking who can fence bug1
>>
>> 3. Each node replies, and stonithd on bug1 chooses bug0 to execute the
>> fencing
>>
>> 4. stonithd on bug0 fences bug1. At this point, it would normally report
>> the result to the DC ... but that happens to be bug1.
>>
>> 5. Once crmd on bug0 takes over DC, it can decide that the fencing
>> succeeded, but it can't take over DC until it sees that the old DC is
>> gone, which takes a while because of your long token timeout. So, this
>> is where the delay is coming in.
>>
>> I'll have to think about whether we can improve this, but I don't think
>> it would be easy. There are complications if for example a fencing
>> topology is used, such that the result being reported in step 4 might
>> not be the entire result.
>>
>>> I'm less convinced that the sending of the STONITH notify in step 2 is
>>> at fault; it also seems possible that a callback is not being run when
>>> it should be.
>>>
>>>
>>> The Pacemaker configuration is not important (I've seen this happen on
>>> our production clusters and on a small sandbox), but the config is:
>>>
>>> primitive bug0-stonith stonith:fence_ipmilan \
>>>         params pcmk_host_list=bug0 ipaddr=bug0-ipmi action=off
>>> login=admin passwd=admin \
>>>         meta target-role=Started
>>> primitive bug1-stonith stonith:fence_ipmilan \
>>>         params pcmk_host_list=bug1 ipaddr=bug1-ipmi action=off
>>> login=admin passwd=admin \
>>>         meta target-role=Started
>>> primitive prm-snmp-heartbeat snmptrap_heartbeat \
>>>         params snmphost=bug0 community=public \
>>>         op monitor interval=10 timeout=300 \
>>>         op start timeout=300 interval=0 \
>>>         op stop timeout=300 interval=0
>>> clone cln-snmp-heartbeat prm-snmp-heartbeat \
>>>         meta interleave=true globally-unique=false ordered=false
>>> notify=false
>>> location bug0-stonith-loc bug0-stonith -inf: bug0
>>> location bug1-stonith-loc bug1-stonith -inf: bug1
>>>
>>> The corosync config might be more interesting:
>>>
>>> totem {
>>>     version: 2
>>>     crypto_cipher: none
>>>     crypto_hash: none
>>>     secauth: off
>>>     rrp_mode: passive
>>>     transport: udpu
>>>     token: 240000
>>>     consensus: 1000
>>>
>>>     interface {
>>>         ringnumber 0
>>>         bindnetaddr: 203.0.113.0
>>>         mcastport: 5405
>>>         ttl: 1
>>>     }
>>> }
>>> nodelist {
>>>         node {
>>>                 ring0_addr: 203.0.113.1
>>>                 nodeid: 1
>>>         }
>>>         node {
>>>                 ring0_addr: 203.0.113.2
>>>                 nodeid: 2
>>>         }
>>> }
>>> quorum {
>>>     provider: corosync_votequorum
>>>     two_node: 1
>>> }
>>>
>>>
>>>     > In trace debug logs, I see the STONITH reply sent via the
>>>     > cpg_mcast_joined (libqb) function in crm_cs_flush
>>>     >
>>>     (stonith_send_async_reply->send_cluster_text->send_cluster_text->send_cpg_iov->crm_cs_flush->cpg_mcast_joined)
>>>     >
>>>     > Mar 13 07:18:22 [6466] bug0 stonith-ng: (  commands.c:1891  )   trace:
>>>     > stonith_send_async_reply:        Reply   <st-reply st_origin="bug1"
>>>     > t="stonith-ng" st_op="st_fence" st_device_id="ustonith:0"
>>>     > st_remote_op="39b1f1e0-b76f-4d25-bd15-77b956c914a0"
>>>     > st_clientid="823e92da-8470-491a-b662-215526cced22"
>>>     > st_clientname="crmd.3973" st_target="bug1" st_device_action="st_fence"
>>>     > st_callid="3" st_callopt="0" st_rc="0" st_output="Chassis Power
>>>     Control:
>>>     > Reset\nChassis Power Control: Down/Off\nChassis Power Control:
>>>     Down/Off\nC
>>>     > Mar 13 07:18:22 [6466] bug0 stonith-ng: (       cpg.c:636   )   trace:
>>>     > send_cluster_text:       Queueing CPG message 9 to all (1041
>>>     bytes, 449
>>>     > bytes payload): <st-reply st_origin="bug1" t="stonith-ng"
>>>     > st_op="st_notify" st_device_id="ustonith:0"
>>>     > st_remote_op="39b1f1e0-b76f-4d25-bd15-77b956c914a0"
>>>     > st_clientid="823e92da-8470-491a-b662-215526cced22" st_clientna
>>>     > Mar 13 07:18:22 [6466] bug0 stonith-ng: (       cpg.c:207   )   trace:
>>>     > send_cpg_iov:    Queueing CPG message 9 (1041 bytes)
>>>     > Mar 13 07:18:22 [6466] bug0 stonith-ng: (       cpg.c:170   )   trace:
>>>     > crm_cs_flush:    CPG message sent, size=1041
>>>     > Mar 13 07:18:22 [6466] bug0 stonith-ng: (       cpg.c:185   )   trace:
>>>     > crm_cs_flush:    Sent 1 CPG messages  (0 remaining, last=9): OK (1)
>>>     >
>>>     > But I see no further action from stonith-ng until minutes later;
>>>     > specifically, I don't see remote_op_done run, so the dead node is
>>>     still
>>>     > an 'online (unclean)' member of the array and no failover can take
>>>     place.
>>>     >
>>>     > When the token expires (yes, we use a very long token), I see the
>>>     following:
>>>     >
>>>     > Mar 13 07:22:11 [6466] bug0 stonith-ng: (membership.c:1018  )  notice:
>>>     > crm_update_peer_state_iter:      Node bug1 state is now lost |
>>>     nodeid=2
>>>     > previous=member source=crm_update_peer_proc
>>>     > Mar 13 07:22:11 [6466] bug0 stonith-ng: (      main.c:1275  )   debug:
>>>     > st_peer_update_callback: Broadcasting our uname because of node 2
>>>     > Mar 13 07:22:11 [6466] bug0 stonith-ng: (       cpg.c:636   )   trace:
>>>     > send_cluster_text:       Queueing CPG message 10 to all (666 bytes, 74
>>>     > bytes payload): <stonith_command __name__="stonith_command"
>>>     > t="stonith-ng" st_op="poke"/>
>>>     > ...
>>>     > Mar 13 07:22:11 [6466] bug0 stonith-ng: (  commands.c:2582  )   debug:
>>>     > stonith_command: Processing st_notify reply 0 from bug0 (
>>>          0)
>>>     > Mar 13 07:22:11 [6466] bug0 stonith-ng: (    remote.c:1945  )   debug:
>>>     > process_remote_stonith_exec:     Marking call to poweroff for bug1 on
>>>     > behalf of crmd.3973 at 39b1f1e0-b76f-4d25-bd15-77b956c914a0.bug1: OK (0)
>>>     >
>>>     > and the STONITH command is finally communicated back to crmd as
>>>     complete
>>>     > and failover commences.
>>>     >
>>>     > Is this delay a feature of the cpg_mcast_joined function?  If I
>>>     > understand correctly (unlikely), it looks like cpg_mcast_joined is not
>>>     > completing because one of the nodes in the group is missing, but I
>>>     > haven't looked at that code closely yet.  Is it advisable to have
>>>     > stonith-ng broadcast a membership change when it successfully
>>>     fences a node?
>>>     >
>>>     > Attaching logs with PCMK_debug=stonith-ng
>>>     > and
>>>     PCMK_trace_functions=stonith_send_async_reply,send_cluster_text,send_cpg_iov,crm_cs_flush
>>>     >
>>>     > Thanks in advance,
>>>     > Chris
>>>
>>>     Can you share your full pacemaker config (please obfuscate passwords).
>>>
>>>     --
>>>     Digimer
>>>     Papers and Projects: https://alteeve.com/w/
>>>     "I am, somehow, less interested in the weight and convolutions of
>>>     Einstein’s brain than in the near certainty that people of equal talent
>>>     have lived and died in cotton fields and sweatshops." - Stephen Jay
>>>     Gould
>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> http://lists.clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org