[ClusterLabs] Pacemaker's "stonith too many failures" log is not accurate

Ken Gaillot kgaillot at redhat.com
Wed May 31 19:39:55 EDT 2017


On 05/26/2017 03:21 AM, 井上 和徳 wrote:
> Hi Ken,
> 
> The cause turned out.
> 
> When stonith is executed, stonithd sends results and notifications to crmd.
> https://github.com/ClusterLabs/pacemaker/blob/0459f409580f41b35ce8ae31fb22e6370a508dab/fencing/remote.c#L402-L406
> 
> - when "result" is sent (calling do_local_reply()), too many stonith failures is checked in too_many_st_failures().
>   https://github.com/ClusterLabs/pacemaker/blob/0459f409580f41b35ce8ae31fb22e6370a508dab/crmd/te_callbacks.c#L638-L669
> - when "notification" is sent (calling do_stonith_notify()), the number of failures is incremented in st_fail_count_increment().
>   https://github.com/ClusterLabs/pacemaker/blob/0459f409580f41b35ce8ae31fb22e6370a508dab/crmd/te_callbacks.c#L704-L726
> From this, since checking is done before incrementing, the number of failures in "Too many failures (10) to fence" log does not match the number of actual failures.

Thanks for this analysis!

We do want the result to be sent before the notifications, so the
solution will be slightly more complicated. The DC will have to call
st_fail_count_increment() when receiving the result, while non-DC nodes
will continue to call it when receiving the notification.

I'll put together a fix before the 1.1.17 release.

> 
> I confirmed that the expected result will be obtained from the following changes.
> 
> # git diff
> diff --git a/fencing/remote.c b/fencing/remote.c
> index 4a47d49..3ff324e 100644
> --- a/fencing/remote.c
> +++ b/fencing/remote.c
> @@ -399,12 +399,12 @@ handle_local_reply_and_notify(remote_fencing_op_t * op, xmlNode * data, int rc)
>      reply = stonith_construct_reply(op->request, NULL, data, rc);
>      crm_xml_add(reply, F_STONITH_DELEGATE, op->delegate);
> 
> -    /* Send fencing OP reply to local client that initiated fencing */
> -    do_local_reply(reply, op->client_id, op->call_options & st_opt_sync_call, FALSE);
> -
>      /* bcast to all local clients that the fencing operation happend */
>      do_stonith_notify(0, T_STONITH_NOTIFY_FENCE, rc, notify_data);
> 
> +    /* Send fencing OP reply to local client that initiated fencing */
> +    do_local_reply(reply, op->client_id, op->call_options & st_opt_sync_call, FALSE);
> +
>      /* mark this op as having notify's already sent */
>      op->notify_sent = TRUE;
>      free_xml(reply);
> 
> Regards,
> Kazunori INOUE
> 
>> -----Original Message-----
>> From: Ken Gaillot [mailto:kgaillot at redhat.com]
>> Sent: Wednesday, May 17, 2017 11:09 PM
>> To: users at clusterlabs.org
>> Subject: Re: [ClusterLabs] Pacemaker's "stonith too many failures" log is not accurate
>>
>> On 05/17/2017 04:56 AM, Klaus Wenninger wrote:
>>> On 05/17/2017 11:28 AM, 井上 和徳 wrote:
>>>> Hi,
>>>> I'm testing Pacemaker-1.1.17-rc1.
>>>> The number of failures in "Too many failures (10) to fence" log does not match the number of actual failures.
>>>
>>> Well it kind of does as after 10 failures it doesn't try fencing again
>>> so that is what
>>> failures stay at ;-)
>>> Of course it still sees the need to fence but doesn't actually try.
>>>
>>> Regards,
>>> Klaus
>>
>> This feature can be a little confusing: it doesn't prevent all further
>> fence attempts of the target, just *immediate* fence attempts. Whenever
>> the next transition is started for some other reason (a configuration or
>> state change, cluster-recheck-interval, node failure, etc.), it will try
>> to fence again.
>>
>> Also, it only checks this threshold if it's aborting a transition
>> *because* of this fence failure. If it's aborting the transition for
>> some other reason, the number can go higher than the threshold. That's
>> what I'm guessing happened here.
>>
>>>> After the 11th time fence failure, "Too many failures (10) to fence" is output.
>>>> Incidentally, stonith-max-attempts has not been set, so it is 10 by default..
>>>>
>>>> [root at x3650f log]# egrep "Requesting fencing|error: Operation reboot|Stonith failed|Too many failures"
>>>> ##Requesting fencing : 1st time
>>>> May 12 05:51:47 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of node rhel73-2
>>>> May 12 05:52:52 rhel73-1 stonith-ng[5265]:   error: Operation reboot of rhel73-2 by rhel73-1 for
>> crmd.5269 at rhel73-1.8415167d: No data available
>>>> May 12 05:52:52 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith failed
>>>> ## 2nd time
>>>> May 12 05:52:52 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of node rhel73-2
>>>> May 12 05:53:56 rhel73-1 stonith-ng[5265]:   error: Operation reboot of rhel73-2 by rhel73-1 for
>> crmd.5269 at rhel73-1.53d3592a: No data available
>>>> May 12 05:53:56 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith failed
>>>> ## 3rd time
>>>> May 12 05:53:56 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of node rhel73-2
>>>> May 12 05:55:01 rhel73-1 stonith-ng[5265]:   error: Operation reboot of rhel73-2 by rhel73-1 for
>> crmd.5269 at rhel73-1.9177cb76: No data available
>>>> May 12 05:55:01 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith failed
>>>> ## 4th time
>>>> May 12 05:55:01 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of node rhel73-2
>>>> May 12 05:56:05 rhel73-1 stonith-ng[5265]:   error: Operation reboot of rhel73-2 by rhel73-1 for
>> crmd.5269 at rhel73-1.946531cb: No data available
>>>> May 12 05:56:05 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith failed
>>>> ## 5th time
>>>> May 12 05:56:05 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of node rhel73-2
>>>> May 12 05:57:10 rhel73-1 stonith-ng[5265]:   error: Operation reboot of rhel73-2 by rhel73-1 for
>> crmd.5269 at rhel73-1.278b3c4b: No data available
>>>> May 12 05:57:10 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith failed
>>>> ## 6th time
>>>> May 12 05:57:10 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of node rhel73-2
>>>> May 12 05:58:14 rhel73-1 stonith-ng[5265]:   error: Operation reboot of rhel73-2 by rhel73-1 for crmd.5269 at rhel73-1.7a49aebb:
>> No data available
>>>> May 12 05:58:14 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith failed
>>>> ## 7th time
>>>> May 12 05:58:14 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of node rhel73-2
>>>> May 12 05:59:19 rhel73-1 stonith-ng[5265]:   error: Operation reboot of rhel73-2 by rhel73-1 for
>> crmd.5269 at rhel73-1.83421862: No data available
>>>> May 12 05:59:19 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith failed
>>>> ## 8th time
>>>> May 12 05:59:19 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of node rhel73-2
>>>> May 12 06:00:24 rhel73-1 stonith-ng[5265]:   error: Operation reboot of rhel73-2 by rhel73-1 for crmd.5269 at rhel73-1.afd7ef98:
>> No data available
>>>> May 12 06:00:24 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith failed
>>>> ## 9th time
>>>> May 12 06:00:24 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of node rhel73-2
>>>> May 12 06:01:28 rhel73-1 stonith-ng[5265]:   error: Operation reboot of rhel73-2 by rhel73-1 for
>> crmd.5269 at rhel73-1.3b033dbe: No data available
>>>> May 12 06:01:28 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith failed
>>>> ## 10th time
>>>> May 12 06:01:28 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of node rhel73-2
>>>> May 12 06:02:33 rhel73-1 stonith-ng[5265]:   error: Operation reboot of rhel73-2 by rhel73-1 for
>> crmd.5269 at rhel73-1.5447a345: No data available
>>>> May 12 06:02:33 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith failed
>>>> ## 11th time
>>>> May 12 06:02:33 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of node rhel73-2
>>>> May 12 06:03:37 rhel73-1 stonith-ng[5265]:   error: Operation reboot of rhel73-2 by rhel73-1 for crmd.5269 at rhel73-1.db50c21a:
>> No data available
>>>> May 12 06:03:37 rhel73-1 crmd[5269]: warning: Too many failures (10) to fence rhel73-2, giving up
>>>> May 12 06:03:37 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith failed
>>>>
>>>> Regards,
>>>> Kazunori INOUE




More information about the Users mailing list