[ClusterLabs] Pacemaker's "stonith too many failures" log is not accurate

Fri May 26 10:21:06 CEST 2017

Hi Ken,

The cause turned out.

When stonith is executed, stonithd sends results and notifications to crmd.
https://github.com/ClusterLabs/pacemaker/blob/0459f409580f41b35ce8ae31fb22e6370a508dab/fencing/remote.c#L402-L406

- when "result" is sent (calling do_local_reply()), too many stonith failures is checked in too_many_st_failures().
  https://github.com/ClusterLabs/pacemaker/blob/0459f409580f41b35ce8ae31fb22e6370a508dab/crmd/te_callbacks.c#L638-L669
- when "notification" is sent (calling do_stonith_notify()), the number of failures is incremented in st_fail_count_increment().
  https://github.com/ClusterLabs/pacemaker/blob/0459f409580f41b35ce8ae31fb22e6370a508dab/crmd/te_callbacks.c#L704-L726
From this, since checking is done before incrementing, the number of failures in "Too many failures (10) to fence" log does not match the number of actual failures.

I confirmed that the expected result will be obtained from the following changes.

# git diff

diff --git a/fencing/remote.c b/fencing/remote.c
index 4a47d49..3ff324e 100644
--- a/fencing/remote.c
+++ b/fencing/remote.c
@@ -399,12 +399,12 @@ handle_local_reply_and_notify(remote_fencing_op_t * op, xmlNode * data, int rc)
     reply = stonith_construct_reply(op->request, NULL, data, rc);
     crm_xml_add(reply, F_STONITH_DELEGATE, op->delegate);

-    /* Send fencing OP reply to local client that initiated fencing */
-    do_local_reply(reply, op->client_id, op->call_options & st_opt_sync_call, FALSE);
-
     /* bcast to all local clients that the fencing operation happend */
     do_stonith_notify(0, T_STONITH_NOTIFY_FENCE, rc, notify_data);

+    /* Send fencing OP reply to local client that initiated fencing */
+    do_local_reply(reply, op->client_id, op->call_options & st_opt_sync_call, FALSE);
+
     /* mark this op as having notify's already sent */
     op->notify_sent = TRUE;
     free_xml(reply);

Regards,
Kazunori INOUE

> -----Original Message-----
> From: Ken Gaillot [mailto:kgaillot at redhat.com]
> Sent: Wednesday, May 17, 2017 11:09 PM
> To: users at clusterlabs.org
> Subject: Re: [ClusterLabs] Pacemaker's "stonith too many failures" log is not accurate
> 
> On 05/17/2017 04:56 AM, Klaus Wenninger wrote:
> > On 05/17/2017 11:28 AM, 井上 和徳 wrote:
> >> Hi,
> >> I'm testing Pacemaker-1.1.17-rc1.
> >> The number of failures in "Too many failures (10) to fence" log does not match the number of actual failures.
> >
> > Well it kind of does as after 10 failures it doesn't try fencing again
> > so that is what
> > failures stay at ;-)
> > Of course it still sees the need to fence but doesn't actually try.
> >
> > Regards,
> > Klaus
> 
> This feature can be a little confusing: it doesn't prevent all further
> fence attempts of the target, just *immediate* fence attempts. Whenever
> the next transition is started for some other reason (a configuration or
> state change, cluster-recheck-interval, node failure, etc.), it will try
> to fence again.
> 
> Also, it only checks this threshold if it's aborting a transition
> *because* of this fence failure. If it's aborting the transition for
> some other reason, the number can go higher than the threshold. That's
> what I'm guessing happened here.
> 
> >> After the 11th time fence failure, "Too many failures (10) to fence" is output.
> >> Incidentally, stonith-max-attempts has not been set, so it is 10 by default..
> >>
> >> [root at x3650f log]# egrep "Requesting fencing|error: Operation reboot|Stonith failed|Too many failures"
> >> ##Requesting fencing : 1st time
> >> May 12 05:51:47 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of node rhel73-2
> >> May 12 05:52:52 rhel73-1 stonith-ng[5265]:   error: Operation reboot of rhel73-2 by rhel73-1 for
> crmd.5269 at rhel73-1.8415167d: No data available
> >> May 12 05:52:52 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith failed
> >> ## 2nd time
> >> May 12 05:52:52 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of node rhel73-2
> >> May 12 05:53:56 rhel73-1 stonith-ng[5265]:   error: Operation reboot of rhel73-2 by rhel73-1 for
> crmd.5269 at rhel73-1.53d3592a: No data available
> >> May 12 05:53:56 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith failed
> >> ## 3rd time
> >> May 12 05:53:56 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of node rhel73-2
> >> May 12 05:55:01 rhel73-1 stonith-ng[5265]:   error: Operation reboot of rhel73-2 by rhel73-1 for
> crmd.5269 at rhel73-1.9177cb76: No data available
> >> May 12 05:55:01 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith failed
> >> ## 4th time
> >> May 12 05:55:01 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of node rhel73-2
> >> May 12 05:56:05 rhel73-1 stonith-ng[5265]:   error: Operation reboot of rhel73-2 by rhel73-1 for
> crmd.5269 at rhel73-1.946531cb: No data available
> >> May 12 05:56:05 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith failed
> >> ## 5th time
> >> May 12 05:56:05 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of node rhel73-2
> >> May 12 05:57:10 rhel73-1 stonith-ng[5265]:   error: Operation reboot of rhel73-2 by rhel73-1 for
> crmd.5269 at rhel73-1.278b3c4b: No data available
> >> May 12 05:57:10 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith failed
> >> ## 6th time
> >> May 12 05:57:10 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of node rhel73-2
> >> May 12 05:58:14 rhel73-1 stonith-ng[5265]:   error: Operation reboot of rhel73-2 by rhel73-1 for crmd.5269 at rhel73-1.7a49aebb:
> No data available
> >> May 12 05:58:14 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith failed
> >> ## 7th time
> >> May 12 05:58:14 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of node rhel73-2
> >> May 12 05:59:19 rhel73-1 stonith-ng[5265]:   error: Operation reboot of rhel73-2 by rhel73-1 for
> crmd.5269 at rhel73-1.83421862: No data available
> >> May 12 05:59:19 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith failed
> >> ## 8th time
> >> May 12 05:59:19 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of node rhel73-2
> >> May 12 06:00:24 rhel73-1 stonith-ng[5265]:   error: Operation reboot of rhel73-2 by rhel73-1 for crmd.5269 at rhel73-1.afd7ef98:
> No data available
> >> May 12 06:00:24 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith failed
> >> ## 9th time
> >> May 12 06:00:24 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of node rhel73-2
> >> May 12 06:01:28 rhel73-1 stonith-ng[5265]:   error: Operation reboot of rhel73-2 by rhel73-1 for
> crmd.5269 at rhel73-1.3b033dbe: No data available
> >> May 12 06:01:28 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith failed
> >> ## 10th time
> >> May 12 06:01:28 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of node rhel73-2
> >> May 12 06:02:33 rhel73-1 stonith-ng[5265]:   error: Operation reboot of rhel73-2 by rhel73-1 for
> crmd.5269 at rhel73-1.5447a345: No data available
> >> May 12 06:02:33 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith failed
> >> ## 11th time
> >> May 12 06:02:33 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of node rhel73-2
> >> May 12 06:03:37 rhel73-1 stonith-ng[5265]:   error: Operation reboot of rhel73-2 by rhel73-1 for crmd.5269 at rhel73-1.db50c21a:
> No data available
> >> May 12 06:03:37 rhel73-1 crmd[5269]: warning: Too many failures (10) to fence rhel73-2, giving up
> >> May 12 06:03:37 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith failed
> >>
> >> Regards,
> >> Kazunori INOUE
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org