[ClusterLabs] Antw: [EXT] Recoveing from node failure

Fri Dec 11 11:10:39 EST 2020

On Fri, 2020-12-11 at 16:37 +0100, Gabriele Bulfon wrote:
> I found I can do this temporarily:
>  
> crm config property cib-bootstrap-options: no-quorum-policy=ignore
>  
> then once node 2 is up again:
>  
> crm config property cib-bootstrap-options: no-quorum-policy=stop
>  
> so that I make sure nodes will not mount in another strange
> situation.
>  
> Is there any better way? (such as ignore until everything is back to
> normal then conisder top again)
>  
> Gabriele

When node 2 is known to be down and staying down, I'd probably disable
wait_for_all in corosync on node 1, start the cluster on node 1, then
re-enable wait_for_all on node 1 (either immediately, or right before
I'm ready to return node 2 to the cluster, depending on how long that
might be).

If a third host is available for a lightweight process, qdevice would
be another option.

> Sonicle S.r.l. : http://www.sonicle.com
> Music: http://www.gabrielebulfon.com
> eXoplanets : https://gabrielebulfon.bandcamp.com/album/exoplanets
>  
>  
> 
> 
> Da: Gabriele Bulfon <gbulfon at sonicle.com>
> A: Cluster Labs - All topics related to open-source clustering
> welcomed <users at clusterlabs.org>
> Data: 11 dicembre 2020 15.51.28 CET
> Oggetto: Re: [ClusterLabs] Antw: [EXT] Recoveing from node failure
> 
> 
> >  
> > I cannot "use wait_for_all: 0", cause this would move automatically
> > a powered off node from UNCLEAN to OFFLINE and mount the ZFS pool
> > (total risk!): I want to manually move from UNCLEAN to OFFLINE,
> > when I know that 2nd node is actually off!
> >  
> > Actually with wait_for_all to default (1) that was the case, so
> > node1 would wait for my intervention when booting and node2 is
> > down.
> > So what think I need is some way to manually override the quorum in
> > such a case (node 2 down for maintenance, node 1 reboot), so I
> > would manually turn OFFLINE node2 from UNCLEAN, manually override
> > quorum and have zpool mount and NFS ip up.
> >  
> > Any idea?
> >  
> >  
> > Sonicle S.r.l. : http://www.sonicle.com
> > Music: http://www.gabrielebulfon.com
> > eXoplanets : https://gabrielebulfon.bandcamp.com/album/exoplanets
> >  
> > 
> > 
> > 
> > -----------------------------------------------------------------
> > -----------------
> > 
> > Da: Ulrich Windl <Ulrich.Windl at rz.uni-regensburg.de>
> > A: users at clusterlabs.org 
> > Data: 11 dicembre 2020 11.35.44 CET
> > Oggetto: [ClusterLabs] Antw: [EXT] Recoveing from node failure
> > 
> > > Hi!
> > > 
> > > Did you take care for special "two node" settings (quorum I
> > > mean)?
> > > When I use "crm_mon -1Arfj", I see something like
> > > " * Current DC: h19 (version 2.0.4+20200616.2deceaa3a-3.3.1-
> > > 2.0.4+20200616.2deceaa3a) - partition with quorum"
> > > 
> > > What do you see?
> > > 
> > > Regards,
> > > Ulrich
> > > 
> > > >>> Gabriele Bulfon <gbulfon at sonicle.com> schrieb am 11.12.2020
> > > um 11:23 in
> > > Nachricht <350849824.6300.1607682209284 at www>:
> > > > Hi, I finally could manage stonith with IPMI in my 2 nodes
> > > XStreamOS/illumos 
> > > > storage cluster.
> > > > I have NFS IPs and shared storage zpool moving from one node or
> > > the other, 
> > > > and stonith controllin ipmi powering off when something is not
> > > clear.
> > > > 
> > > > What happens now is that if I shutdown 2nd node, I see the
> > > OFFLINE status 
> > > > from node 1 and everything is up and running, and this is ok:
> > > > 
> > > > Online: [ xstha1 ]
> > > > OFFLINE: [ xstha2 ]
> > > > Full list of resources:
> > > > xstha1_san0_IP (ocf::heartbeat:IPaddr): Started xstha1
> > > > xstha2_san0_IP (ocf::heartbeat:IPaddr): Started xstha1
> > > > xstha1-stonith (stonith:external/ipmi): Started xstha1
> > > > xstha2-stonith (stonith:external/ipmi): Started xstha1
> > > > zpool_data (ocf::heartbeat:ZFS): Started xstha1
> > > > But if also reboot 1st node, it starts with the UNCLEAN state,
> > > nothing is 
> > > > running, so I clearstate of node 2, but resources are not
> > > started:
> > > > 
> > > > Online: [ xstha1 ]
> > > > OFFLINE: [ xstha2 ]
> > > > Full list of resources:
> > > > xstha1_san0_IP (ocf::heartbeat:IPaddr): Stopped
> > > > xstha2_san0_IP (ocf::heartbeat:IPaddr): Stopped
> > > > xstha1-stonith (stonith:external/ipmi): Stopped
> > > > xstha2-stonith (stonith:external/ipmi): Stopped
> > > > zpool_data (ocf::heartbeat:ZFS): Stopped
> > > > I tried restarting zpool_data or other resources:
> > > > # crm resource start zpool_data
> > > > but nothing happens!
> > > > How can I recover from this state? Node2 needs to stay down,
> > > but I want 
> > > > node1 to work.
> > > > Thanks!
> > > > Gabriele 
> > > > 
> > > > 
> > > > Sonicle S.r.l. : http://www.sonicle.com 
> > > > Music: http://www.gabrielebulfon.com 
> > > > eXoplanets : 
> > > https://gabrielebulfon.bandcamp.com/album/exoplanets 
> > > > 
> > > 
> > > 
> > > 
> > > 
> > > _______________________________________________
> > > Manage your subscription:
> > > https://lists.clusterlabs.org/mailman/listinfo/users
> > > 
> > > ClusterLabs home: https://www.clusterlabs.org/
> > > 
> > > 
> > 
> > _______________________________________________
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> > 
> > ClusterLabs home: https://www.clusterlabs.org/
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
-- 
Ken Gaillot <kgaillot at redhat.com>