[ClusterLabs] Pacemaker's "stonith too many failures" log is not accurate

Ken Gaillot kgaillot at redhat.com
Wed May 17 16:09:08 CEST 2017


On 05/17/2017 04:56 AM, Klaus Wenninger wrote:
> On 05/17/2017 11:28 AM, 井上 和徳 wrote:
>> Hi,
>> I'm testing Pacemaker-1.1.17-rc1.
>> The number of failures in "Too many failures (10) to fence" log does not match the number of actual failures.
> 
> Well it kind of does as after 10 failures it doesn't try fencing again
> so that is what
> failures stay at ;-)
> Of course it still sees the need to fence but doesn't actually try.
> 
> Regards,
> Klaus

This feature can be a little confusing: it doesn't prevent all further
fence attempts of the target, just *immediate* fence attempts. Whenever
the next transition is started for some other reason (a configuration or
state change, cluster-recheck-interval, node failure, etc.), it will try
to fence again.

Also, it only checks this threshold if it's aborting a transition
*because* of this fence failure. If it's aborting the transition for
some other reason, the number can go higher than the threshold. That's
what I'm guessing happened here.

>> After the 11th time fence failure, "Too many failures (10) to fence" is output.
>> Incidentally, stonith-max-attempts has not been set, so it is 10 by default..
>>
>> [root at x3650f log]# egrep "Requesting fencing|error: Operation reboot|Stonith failed|Too many failures"
>> ##Requesting fencing : 1st time
>> May 12 05:51:47 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of node rhel73-2
>> May 12 05:52:52 rhel73-1 stonith-ng[5265]:   error: Operation reboot of rhel73-2 by rhel73-1 for crmd.5269 at rhel73-1.8415167d: No data available
>> May 12 05:52:52 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith failed
>> ## 2nd time
>> May 12 05:52:52 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of node rhel73-2
>> May 12 05:53:56 rhel73-1 stonith-ng[5265]:   error: Operation reboot of rhel73-2 by rhel73-1 for crmd.5269 at rhel73-1.53d3592a: No data available
>> May 12 05:53:56 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith failed
>> ## 3rd time
>> May 12 05:53:56 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of node rhel73-2
>> May 12 05:55:01 rhel73-1 stonith-ng[5265]:   error: Operation reboot of rhel73-2 by rhel73-1 for crmd.5269 at rhel73-1.9177cb76: No data available
>> May 12 05:55:01 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith failed
>> ## 4th time
>> May 12 05:55:01 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of node rhel73-2
>> May 12 05:56:05 rhel73-1 stonith-ng[5265]:   error: Operation reboot of rhel73-2 by rhel73-1 for crmd.5269 at rhel73-1.946531cb: No data available
>> May 12 05:56:05 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith failed
>> ## 5th time
>> May 12 05:56:05 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of node rhel73-2
>> May 12 05:57:10 rhel73-1 stonith-ng[5265]:   error: Operation reboot of rhel73-2 by rhel73-1 for crmd.5269 at rhel73-1.278b3c4b: No data available
>> May 12 05:57:10 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith failed
>> ## 6th time
>> May 12 05:57:10 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of node rhel73-2
>> May 12 05:58:14 rhel73-1 stonith-ng[5265]:   error: Operation reboot of rhel73-2 by rhel73-1 for crmd.5269 at rhel73-1.7a49aebb: No data available
>> May 12 05:58:14 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith failed
>> ## 7th time
>> May 12 05:58:14 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of node rhel73-2
>> May 12 05:59:19 rhel73-1 stonith-ng[5265]:   error: Operation reboot of rhel73-2 by rhel73-1 for crmd.5269 at rhel73-1.83421862: No data available
>> May 12 05:59:19 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith failed
>> ## 8th time
>> May 12 05:59:19 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of node rhel73-2
>> May 12 06:00:24 rhel73-1 stonith-ng[5265]:   error: Operation reboot of rhel73-2 by rhel73-1 for crmd.5269 at rhel73-1.afd7ef98: No data available
>> May 12 06:00:24 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith failed
>> ## 9th time
>> May 12 06:00:24 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of node rhel73-2
>> May 12 06:01:28 rhel73-1 stonith-ng[5265]:   error: Operation reboot of rhel73-2 by rhel73-1 for crmd.5269 at rhel73-1.3b033dbe: No data available
>> May 12 06:01:28 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith failed
>> ## 10th time
>> May 12 06:01:28 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of node rhel73-2
>> May 12 06:02:33 rhel73-1 stonith-ng[5265]:   error: Operation reboot of rhel73-2 by rhel73-1 for crmd.5269 at rhel73-1.5447a345: No data available
>> May 12 06:02:33 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith failed
>> ## 11th time
>> May 12 06:02:33 rhel73-1 crmd[5269]:  notice: Requesting fencing (reboot) of node rhel73-2
>> May 12 06:03:37 rhel73-1 stonith-ng[5265]:   error: Operation reboot of rhel73-2 by rhel73-1 for crmd.5269 at rhel73-1.db50c21a: No data available
>> May 12 06:03:37 rhel73-1 crmd[5269]: warning: Too many failures (10) to fence rhel73-2, giving up
>> May 12 06:03:37 rhel73-1 crmd[5269]:  notice: Transition aborted: Stonith failed
>>
>> Regards,
>> Kazunori INOUE



More information about the Users mailing list