[ClusterLabs] Antw: Re: Antw: [EXT] Stonith failing

Wed Jul 29 05:48:21 EDT 2020

>>> Reid Wahl <nwahl at redhat.com> schrieb am 29.07.2020 um 11:39 in Nachricht
<CAPiuu98aDaGzDunKaSR3rchRC+O9MH8UqaTsn36q633nDXadMA at mail.gmail.com>:
> "As it stated in the comments, we don't want to halt or boot via ssh, only
> reboot."
> 
> Generally speaking, a stonith reboot action consists of the following basic
> sequence of events:
> 
>    1. Execute the fence agent with the "off" action.
>    2. Poll the power status of the fenced node until it is powered off.
>    3. Execute the fence agent with the "on" action.
>    4. Poll the power status of the fenced node until it is powered on.
> 
> So a custom fence agent that supports reboots, actually needs to support
> off and on actions.

Are you sure? Sbd can do "off" action, but when the node is off, it cannot perform an "on" action. So either you can use "off" and the node will remain off, or you use "reboot" and the node will be reset (and come up again, hopefully).

> 
> 
> As Andrei noted, ssh is **not** a reliable method by which to ensure a node
> gets rebooted or stops using cluster-managed resources. You can't depend on
> the ability to SSH to an unhealthy node that needs to be fenced.
> 
> The only way to guarantee that an unhealthy or unresponsive node stops all
> access to shared resources is to power off or reboot the node. (In the case
> of resources that rely on shared storage, I/O fencing instead of power
> fencing can also work, but that's not ideal.)
> 
> As others have said, SBD is a great option. Use it if you can. There are
> also power fencing methods (one example is fence_ipmilan, but the options
> available depend on your hardware or virt platform) that are reliable under
> most circumstances.
> 
> You said that when you stop corosync on node 2, Pacemaker tries to fence
> node 2. There are a couple of possible reasons for that. One possibility is
> that you stopped or killed corosync without stopping Pacemaker first. (If
> you use pcs, then try `pcs cluster stop`.) Another possibility is that
> resources failed to stop during cluster shutdown on node 2, causing node 2
> to be fenced.
> 
> On Wed, Jul 29, 2020 at 12:47 AM Andrei Borzenkov <arvidjaar at gmail.com>
> wrote:
> 
>>
>>
>> On Wed, Jul 29, 2020 at 9:01 AM Gabriele Bulfon <gbulfon at sonicle.com>
>> wrote:
>>
>>> That one was taken from a specific implementation on Solaris 11.
>>> The situation is a dual node server with shared storage controller: both
>>> nodes see the same disks concurrently.
>>> Here we must be sure that the two nodes are not going to import/mount the
>>> same zpool at the same time, or we will encounter data corruption:
>>>
>>
>> ssh based "stonith" cannot guarantee it.
>>
>>
>>
>>> node 1 will be perferred for pool 1, node 2 for pool 2, only in case one
>>> of the node goes down or is taken offline the resources should be first
>>> free by the leaving node and taken by the other node.
>>>
>>> Would you suggest one of the available stonith in this case?
>>>
>>>
>>
>> IPMI, managed PDU, SBD ...
>>
>> In practice, the only stonith method that works in case of complete node
>> outage including any power supply is SBD.
>> _______________________________________________
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users 
>>
>> ClusterLabs home: https://www.clusterlabs.org/ 
>>
> 
> 
> -- 
> Regards,
> 
> Reid Wahl, RHCA
> Software Maintenance Engineer, Red Hat
> CEE - Platform Support Delivery - ClusterHA