[ClusterLabs] Antw: Re: Antw: Re: Antw: Re: Antw: RES: Pacemaker and OCFS2 on stand alone mode

Wed Jul 13 14:07:28 UTC 2016

On 07/13/2016 03:10 AM, Ulrich Windl wrote:
>>>> Ken Gaillot <kgaillot at redhat.com> schrieb am 12.07.2016 um 21:19 in Nachricht
> <578542BF.9010303 at redhat.com>:
>> On 07/12/2016 01:16 AM, Ulrich Windl wrote:
> 
> [...]
>>> What I mean is: there is no "success status" for STONITH; it is assumed that
>>> the node will be down after issuing a successful stonith command. You are
>>> claiming your stonith command was not logging any error, so the cluster will
>>> assume STONITH was successful after a timeout.
>>
>> Fence agents do return success/failure; the cluster considers a timeout
>> to be a failure. The only time the cluster assumes a successful fence is
>> when sbd-based watchdog is in use.
> 
> Hi!
> 
> Sorry, but I don't see the difference: If SBD delivers a command successfully, there is no guarantee that the victim node actually executes the command and resets.
> If you use any other fencing command (like submitting some command to an external device) the situation is not different: Successfully submitting the command does not mean the STONITH will succeed in every case (you could even tun off power in the wrong PDU, which is still a "success" from the cluster's perspective)
> [...]
> 
> What I really wanted to say is:
> If the fencing command logged an error, try to fix it; if it did not, try to find out why fencing did not work.
> 
> Regards,
> Ulrich

Yes, I understand your point now, and agree completely.

The cluster can only respond to the status code (or timeout) it receives
from the fence agent. There may be problems beyond that point (in the
fence agent and/or the device itself) that result in success being
returned incorrectly, and that must be investigated separately.