[ClusterLabs] Antw: [EXT] Stonith failing

Thu Jul 30 08:51:32 EDT 2020

SBD can use iSCSI  (for example target is also  the quorum node), disk partition  or  lvm LV,  so  I guess  it can also  use  a ZFS volume  dedicated  for  the  SBD (10MB  is enough).
In your  case IPMI  is quite  suitable.

About the power  fencing  when persistent  reservations  are  removed  ->  it's  just a  script started by the watchdog.service  on the node itself.It should be usable on all Linuxes  and many UNIX-like OSes.

Best  Regards,
Strahil Nikolov

На 30 юли 2020 г. 12:05:39 GMT+03:00, Gabriele Bulfon <gbulfon at sonicle.com> написа:
>Reading sbd from SuSE I saw that it requires a special block to write
>informations, I don't think this is possibile here.
> 
>It's a dual node ZFS storage running our own XStreamOS/illumos
>distribution, and here we're trying to add HA capabilities.
>We can move IPs, ZFS Pools and COMSTAR/iSCSI/FC, and now looking for a
>stable way to manage stonith.
> 
>The hardware system is this:
> 
>https://www.supermicro.com/products/system/1u/1029/SYS-1029TP-DC0R.cfm
> 
>and it features a shared SAS3 backplane, so both nodes can see all the
>discs concurrently.
> 
>Gabriele
> 
> 
>Sonicle S.r.l. 
>: 
>http://www.sonicle.com
>Music: 
>http://www.gabrielebulfon.com
>Quantum Mechanics : 
>http://www.cdbaby.com/cd/gabrielebulfon
>Da:
>Reid Wahl
>A:
>Cluster Labs - All topics related to open-source clustering welcomed
>Data:
>30 luglio 2020 6.38.58 CEST
>Oggetto:
>Re: [ClusterLabs] Antw: [EXT] Stonith failing
>I don't know of a stonith method that acts upon a filesystem directly.
>You'd generally want to act upon the power state of the node or upon
>the underlying shared storage.
> 
>What kind of hardware or virtualization platform are these systems
>running on? If there is a hardware watchdog timer, then sbd is
>possible. The fence_sbd agent (poison-pill fencing via block device)
>requires shared block storage, but sbd itself only requires a hardware
>watchdog timer.
> 
>Additionally, there may be an existing fence agent that can connect to
>the controller you mentioned. What kind of controller is it?
>On Wed, Jul 29, 2020 at 5:24 AM Gabriele Bulfon
>gbulfon at sonicle.com
>wrote:
>Thanks a lot for the extensive explanation!
>Any idea about a ZFS stonith?
> 
>Gabriele
> 
> 
>Sonicle S.r.l. 
>: 
>http://www.sonicle.com
>Music: 
>http://www.gabrielebulfon.com
>Quantum Mechanics : 
>http://www.cdbaby.com/cd/gabrielebulfon
>Da:
>Reid Wahl
>nwahl at redhat.com
>A:
>Cluster Labs - All topics related to open-source clustering welcomed
>users at clusterlabs.org
>Data:
>29 luglio 2020 11.39.35 CEST
>Oggetto:
>Re: [ClusterLabs] Antw: [EXT] Stonith failing
>"As it stated in the comments, we don't want to halt or boot via ssh,
>only reboot."
> 
>Generally speaking, a stonith reboot action consists of the following
>basic sequence of events:
>Execute the fence agent with the "off" action.
>Poll the power status of the fenced node until it is powered off.
>Execute the fence agent with the "on" action.
>Poll the power status of the fenced node until it is powered on.
>So a custom fence agent that supports reboots, actually needs to
>support off and on actions.
> 
> 
>As Andrei noted, ssh is **not** a reliable method by which to ensure a
>node gets rebooted or stops using cluster-managed resources. You can't
>depend on the ability to SSH to an unhealthy node that needs to be
>fenced.
> 
>The only way to guarantee that an unhealthy or unresponsive node stops
>all access to shared resources is to power off or reboot the node. (In
>the case of resources that rely on shared storage, I/O fencing instead
>of power fencing can also work, but that's not ideal.)
> 
>As others have said, SBD is a great option. Use it if you can. There
>are also power fencing methods (one example is fence_ipmilan, but the
>options available depend on your hardware or virt platform) that are
>reliable under most circumstances.
> 
>You said that when you stop corosync on node 2, Pacemaker tries to
>fence node 2. There are a couple of possible reasons for that. One
>possibility is that you stopped or killed corosync without stopping
>Pacemaker first. (If you use pcs, then try `pcs cluster stop`.) Another
>possibility is that resources failed to stop during cluster shutdown on
>node 2, causing node 2 to be fenced.
>On Wed, Jul 29, 2020 at 12:47 AM Andrei Borzenkov
>arvidjaar at gmail.com
>wrote:
> 
>On Wed, Jul 29, 2020 at 9:01 AM Gabriele Bulfon
>gbulfon at sonicle.com
>wrote:
>That one was taken from a specific implementation on Solaris 11.
>The situation is a dual node server with shared storage controller:
>both nodes see the same disks concurrently.
>Here we must be sure that the two nodes are not going to import/mount
>the same zpool at the same time, or we will encounter data corruption:
> 
>ssh based "stonith" cannot guarantee it.
> 
>node 1 will be perferred for pool 1, node 2 for pool 2, only in case
>one of the node goes down or is taken offline the resources should be
>first free by the leaving node and taken by the other node.
> 
>Would you suggest one of the available stonith in this case?
> 
> 
>IPMI, managed PDU, SBD ...
>In practice, the only stonith method that works in case of complete
>node outage including any power supply is SBD.
>_______________________________________________
>Manage your subscription:
>https://lists.clusterlabs.org/mailman/listinfo/users
>ClusterLabs home:
>https://www.clusterlabs.org/
>--
>Regards,
>Reid Wahl, RHCA
>Software Maintenance Engineer, Red Hat
>CEE - Platform Support Delivery - ClusterHA
>_______________________________________________Manage your
>subscription:
>https://lists.clusterlabs.org/mailman/listinfo/users
>ClusterLabs home:
>https://www.clusterlabs.org/
>_______________________________________________
>Manage your subscription:
>https://lists.clusterlabs.org/mailman/listinfo/users
>ClusterLabs home:
>https://www.clusterlabs.org/
>--
>Regards,
>Reid Wahl, RHCA
>Software Maintenance Engineer, Red Hat
>CEE - Platform Support Delivery - ClusterHA
>_______________________________________________Manage your
>subscription:https://lists.clusterlabs.org/mailman/listinfo/usersClusterLabs
>home: https://www.clusterlabs.org/