[ClusterLabs] Pacemaker's "stonith too many failures" log is not accurate
井上 和徳
inouekazu at intellilink.co.jp
Fri May 26 10:21:06 CEST 2017
Hi Ken,
The cause turned out.
When stonith is executed, stonithd sends results and notifications to crmd.
https://github.com/ClusterLabs/pacemaker/blob/0459f409580f41b35ce8ae31fb22e6370a508dab/fencing/remote.c#L402-L406
- when "result" is sent (calling do_local_reply()), too many stonith failures is checked in too_many_st_failures().
https://github.com/ClusterLabs/pacemaker/blob/0459f409580f41b35ce8ae31fb22e6370a508dab/crmd/te_callbacks.c#L638-L669
- when "notification" is sent (calling do_stonith_notify()), the number of failures is incremented in st_fail_count_increment().
https://github.com/ClusterLabs/pacemaker/blob/0459f409580f41b35ce8ae31fb22e6370a508dab/crmd/te_callbacks.c#L704-L726
From this, since checking is done before incrementing, the number of failures in "Too many failures (10) to fence" log does not match the number of actual failures.
I confirmed that the expected result will be obtained from the following changes.
# git diff
diff --git a/fencing/remote.c b/fencing/remote.c
index 4a47d49..3ff324e 100644
--- a/fencing/remote.c
+++ b/fencing/remote.c
@@ -399,12 +399,12 @@ handle_local_reply_and_notify(remote_fencing_op_t * op, xmlNode * data, int rc)
reply = stonith_construct_reply(op->request, NULL, data, rc);
crm_xml_add(reply, F_STONITH_DELEGATE, op->delegate);
- /* Send fencing OP reply to local client that initiated fencing */
- do_local_reply(reply, op->client_id, op->call_options & st_opt_sync_call, FALSE);
-
/* bcast to all local clients that the fencing operation happend */
do_stonith_notify(0, T_STONITH_NOTIFY_FENCE, rc, notify_data);
+ /* Send fencing OP reply to local client that initiated fencing */
+ do_local_reply(reply, op->client_id, op->call_options & st_opt_sync_call, FALSE);
+
/* mark this op as having notify's already sent */
op->notify_sent = TRUE;
free_xml(reply);
Regards,
Kazunori INOUE
> -----Original Message-----
> From: Ken Gaillot [mailto:kgaillot at redhat.com]
> Sent: Wednesday, May 17, 2017 11:09 PM
> To: users at clusterlabs.org
> Subject: Re: [ClusterLabs] Pacemaker's "stonith too many failures" log is not accurate
>
> On 05/17/2017 04:56 AM, Klaus Wenninger wrote:
> > On 05/17/2017 11:28 AM, 井上 和徳 wrote:
> >> Hi,
> >> I'm testing Pacemaker-1.1.17-rc1.
> >> The number of failures in "Too many failures (10) to fence" log does not match the number of actual failures.
> >
> > Well it kind of does as after 10 failures it doesn't try fencing again
> > so that is what
> > failures stay at ;-)
> > Of course it still sees the need to fence but doesn't actually try.
> >
> > Regards,
> > Klaus
>
> This feature can be a little confusing: it doesn't prevent all further
> fence attempts of the target, just *immediate* fence attempts. Whenever
> the next transition is started for some other reason (a configuration or
> state change, cluster-recheck-interval, node failure, etc.), it will try
> to fence again.
>
> Also, it only checks this threshold if it's aborting a transition
> *because* of this fence failure. If it's aborting the transition for
> some other reason, the number can go higher than the threshold. That's
> what I'm guessing happened here.
>
> >> After the 11th time fence failure, "Too many failures (10) to fence" is output.
> >> Incidentally, stonith-max-attempts has not been set, so it is 10 by default..
> >>
> >> [root at x3650f log]# egrep "Requesting fencing|error: Operation reboot|Stonith failed|Too many failures"
> >> ##Requesting fencing : 1st time
> >> May 12 05:51:47 rhel73-1 crmd[5269]: notice: Requesting fencing (reboot) of node rhel73-2
> >> May 12 05:52:52 rhel73-1 stonith-ng[5265]: error: Operation reboot of rhel73-2 by rhel73-1 for
> crmd.5269 at rhel73-1.8415167d: No data available
> >> May 12 05:52:52 rhel73-1 crmd[5269]: notice: Transition aborted: Stonith failed
> >> ## 2nd time
> >> May 12 05:52:52 rhel73-1 crmd[5269]: notice: Requesting fencing (reboot) of node rhel73-2
> >> May 12 05:53:56 rhel73-1 stonith-ng[5265]: error: Operation reboot of rhel73-2 by rhel73-1 for
> crmd.5269 at rhel73-1.53d3592a: No data available
> >> May 12 05:53:56 rhel73-1 crmd[5269]: notice: Transition aborted: Stonith failed
> >> ## 3rd time
> >> May 12 05:53:56 rhel73-1 crmd[5269]: notice: Requesting fencing (reboot) of node rhel73-2
> >> May 12 05:55:01 rhel73-1 stonith-ng[5265]: error: Operation reboot of rhel73-2 by rhel73-1 for
> crmd.5269 at rhel73-1.9177cb76: No data available
> >> May 12 05:55:01 rhel73-1 crmd[5269]: notice: Transition aborted: Stonith failed
> >> ## 4th time
> >> May 12 05:55:01 rhel73-1 crmd[5269]: notice: Requesting fencing (reboot) of node rhel73-2
> >> May 12 05:56:05 rhel73-1 stonith-ng[5265]: error: Operation reboot of rhel73-2 by rhel73-1 for
> crmd.5269 at rhel73-1.946531cb: No data available
> >> May 12 05:56:05 rhel73-1 crmd[5269]: notice: Transition aborted: Stonith failed
> >> ## 5th time
> >> May 12 05:56:05 rhel73-1 crmd[5269]: notice: Requesting fencing (reboot) of node rhel73-2
> >> May 12 05:57:10 rhel73-1 stonith-ng[5265]: error: Operation reboot of rhel73-2 by rhel73-1 for
> crmd.5269 at rhel73-1.278b3c4b: No data available
> >> May 12 05:57:10 rhel73-1 crmd[5269]: notice: Transition aborted: Stonith failed
> >> ## 6th time
> >> May 12 05:57:10 rhel73-1 crmd[5269]: notice: Requesting fencing (reboot) of node rhel73-2
> >> May 12 05:58:14 rhel73-1 stonith-ng[5265]: error: Operation reboot of rhel73-2 by rhel73-1 for crmd.5269 at rhel73-1.7a49aebb:
> No data available
> >> May 12 05:58:14 rhel73-1 crmd[5269]: notice: Transition aborted: Stonith failed
> >> ## 7th time
> >> May 12 05:58:14 rhel73-1 crmd[5269]: notice: Requesting fencing (reboot) of node rhel73-2
> >> May 12 05:59:19 rhel73-1 stonith-ng[5265]: error: Operation reboot of rhel73-2 by rhel73-1 for
> crmd.5269 at rhel73-1.83421862: No data available
> >> May 12 05:59:19 rhel73-1 crmd[5269]: notice: Transition aborted: Stonith failed
> >> ## 8th time
> >> May 12 05:59:19 rhel73-1 crmd[5269]: notice: Requesting fencing (reboot) of node rhel73-2
> >> May 12 06:00:24 rhel73-1 stonith-ng[5265]: error: Operation reboot of rhel73-2 by rhel73-1 for crmd.5269 at rhel73-1.afd7ef98:
> No data available
> >> May 12 06:00:24 rhel73-1 crmd[5269]: notice: Transition aborted: Stonith failed
> >> ## 9th time
> >> May 12 06:00:24 rhel73-1 crmd[5269]: notice: Requesting fencing (reboot) of node rhel73-2
> >> May 12 06:01:28 rhel73-1 stonith-ng[5265]: error: Operation reboot of rhel73-2 by rhel73-1 for
> crmd.5269 at rhel73-1.3b033dbe: No data available
> >> May 12 06:01:28 rhel73-1 crmd[5269]: notice: Transition aborted: Stonith failed
> >> ## 10th time
> >> May 12 06:01:28 rhel73-1 crmd[5269]: notice: Requesting fencing (reboot) of node rhel73-2
> >> May 12 06:02:33 rhel73-1 stonith-ng[5265]: error: Operation reboot of rhel73-2 by rhel73-1 for
> crmd.5269 at rhel73-1.5447a345: No data available
> >> May 12 06:02:33 rhel73-1 crmd[5269]: notice: Transition aborted: Stonith failed
> >> ## 11th time
> >> May 12 06:02:33 rhel73-1 crmd[5269]: notice: Requesting fencing (reboot) of node rhel73-2
> >> May 12 06:03:37 rhel73-1 stonith-ng[5265]: error: Operation reboot of rhel73-2 by rhel73-1 for crmd.5269 at rhel73-1.db50c21a:
> No data available
> >> May 12 06:03:37 rhel73-1 crmd[5269]: warning: Too many failures (10) to fence rhel73-2, giving up
> >> May 12 06:03:37 rhel73-1 crmd[5269]: notice: Transition aborted: Stonith failed
> >>
> >> Regards,
> >> Kazunori INOUE
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
More information about the Users
mailing list