[ClusterLabs] Antw: Re: Antw: [EXT] Stonith failing

Wed Jul 29 06:03:37 EDT 2020

On Wed, Jul 29, 2020 at 2:48 AM Ulrich Windl <
Ulrich.Windl at rz.uni-regensburg.de> wrote:

> >>> Reid Wahl <nwahl at redhat.com> schrieb am 29.07.2020 um 11:39 in
> Nachricht
> <CAPiuu98aDaGzDunKaSR3rchRC+O9MH8UqaTsn36q633nDXadMA at mail.gmail.com>:
> > "As it stated in the comments, we don't want to halt or boot via ssh,
> only
> > reboot."
> >
> > Generally speaking, a stonith reboot action consists of the following
> basic
> > sequence of events:
> >
> >    1. Execute the fence agent with the "off" action.
> >    2. Poll the power status of the fenced node until it is powered off.
> >    3. Execute the fence agent with the "on" action.
> >    4. Poll the power status of the fenced node until it is powered on.
> >
> > So a custom fence agent that supports reboots, actually needs to support
> > off and on actions.
>
> Are you sure? Sbd can do "off" action, but when the node is off, it cannot
> perform an "on" action. So either you can use "off" and the node will
> remain off, or you use "reboot" and the node will be reset (and come up
> again, hopefully).
>

I'm referring to conventional power fencing agents. Sorry for not
clarifying. Conventional power fencing (e.g., fence_ipmilan and
fence_vmware_soap) is most of what I see deployed on a daily basis.

> >
> >
> > As Andrei noted, ssh is **not** a reliable method by which to ensure a
> node
> > gets rebooted or stops using cluster-managed resources. You can't depend
> on
> > the ability to SSH to an unhealthy node that needs to be fenced.
> >
> > The only way to guarantee that an unhealthy or unresponsive node stops
> all
> > access to shared resources is to power off or reboot the node. (In the
> case
> > of resources that rely on shared storage, I/O fencing instead of power
> > fencing can also work, but that's not ideal.)
> >
> > As others have said, SBD is a great option. Use it if you can. There are
> > also power fencing methods (one example is fence_ipmilan, but the options
> > available depend on your hardware or virt platform) that are reliable
> under
> > most circumstances.
> >
> > You said that when you stop corosync on node 2, Pacemaker tries to fence
> > node 2. There are a couple of possible reasons for that. One possibility
> is
> > that you stopped or killed corosync without stopping Pacemaker first. (If
> > you use pcs, then try `pcs cluster stop`.) Another possibility is that
> > resources failed to stop during cluster shutdown on node 2, causing node
> 2
> > to be fenced.
> >
> > On Wed, Jul 29, 2020 at 12:47 AM Andrei Borzenkov <arvidjaar at gmail.com>
> > wrote:
> >
> >>
> >>
> >> On Wed, Jul 29, 2020 at 9:01 AM Gabriele Bulfon <gbulfon at sonicle.com>
> >> wrote:
> >>
> >>> That one was taken from a specific implementation on Solaris 11.
> >>> The situation is a dual node server with shared storage controller:
> both
> >>> nodes see the same disks concurrently.
> >>> Here we must be sure that the two nodes are not going to import/mount
> the
> >>> same zpool at the same time, or we will encounter data corruption:
> >>>
> >>
> >> ssh based "stonith" cannot guarantee it.
> >>
> >>
> >>
> >>> node 1 will be perferred for pool 1, node 2 for pool 2, only in case
> one
> >>> of the node goes down or is taken offline the resources should be first
> >>> free by the leaving node and taken by the other node.
> >>>
> >>> Would you suggest one of the available stonith in this case?
> >>>
> >>>
> >>
> >> IPMI, managed PDU, SBD ...
> >>
> >> In practice, the only stonith method that works in case of complete node
> >> outage including any power supply is SBD.
> >> _______________________________________________
> >> Manage your subscription:
> >> https://lists.clusterlabs.org/mailman/listinfo/users
> >>
> >> ClusterLabs home: https://www.clusterlabs.org/
> >>
> >
> >
> > --
> > Regards,
> >
> > Reid Wahl, RHCA
> > Software Maintenance Engineer, Red Hat
> > CEE - Platform Support Delivery - ClusterHA
>
>
>
>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
>

-- 
Regards,

Reid Wahl, RHCA
Software Maintenance Engineer, Red Hat
CEE - Platform Support Delivery - ClusterHA
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20200729/889f8424/attachment.htm>