[ClusterLabs] Antw: [EXT] Recoveing from node failure
Ken Gaillot
kgaillot at redhat.com
Fri Dec 11 11:10:39 EST 2020
On Fri, 2020-12-11 at 16:37 +0100, Gabriele Bulfon wrote:
> I found I can do this temporarily:
>
> crm config property cib-bootstrap-options: no-quorum-policy=ignore
>
> then once node 2 is up again:
>
> crm config property cib-bootstrap-options: no-quorum-policy=stop
>
> so that I make sure nodes will not mount in another strange
> situation.
>
> Is there any better way? (such as ignore until everything is back to
> normal then conisder top again)
>
> Gabriele
When node 2 is known to be down and staying down, I'd probably disable
wait_for_all in corosync on node 1, start the cluster on node 1, then
re-enable wait_for_all on node 1 (either immediately, or right before
I'm ready to return node 2 to the cluster, depending on how long that
might be).
If a third host is available for a lightweight process, qdevice would
be another option.
> Sonicle S.r.l. : http://www.sonicle.com
> Music: http://www.gabrielebulfon.com
> eXoplanets : https://gabrielebulfon.bandcamp.com/album/exoplanets
>
>
>
>
> Da: Gabriele Bulfon <gbulfon at sonicle.com>
> A: Cluster Labs - All topics related to open-source clustering
> welcomed <users at clusterlabs.org>
> Data: 11 dicembre 2020 15.51.28 CET
> Oggetto: Re: [ClusterLabs] Antw: [EXT] Recoveing from node failure
>
>
> >
> > I cannot "use wait_for_all: 0", cause this would move automatically
> > a powered off node from UNCLEAN to OFFLINE and mount the ZFS pool
> > (total risk!): I want to manually move from UNCLEAN to OFFLINE,
> > when I know that 2nd node is actually off!
> >
> > Actually with wait_for_all to default (1) that was the case, so
> > node1 would wait for my intervention when booting and node2 is
> > down.
> > So what think I need is some way to manually override the quorum in
> > such a case (node 2 down for maintenance, node 1 reboot), so I
> > would manually turn OFFLINE node2 from UNCLEAN, manually override
> > quorum and have zpool mount and NFS ip up.
> >
> > Any idea?
> >
> >
> > Sonicle S.r.l. : http://www.sonicle.com
> > Music: http://www.gabrielebulfon.com
> > eXoplanets : https://gabrielebulfon.bandcamp.com/album/exoplanets
> >
> >
> >
> >
> > -----------------------------------------------------------------
> > -----------------
> >
> > Da: Ulrich Windl <Ulrich.Windl at rz.uni-regensburg.de>
> > A: users at clusterlabs.org
> > Data: 11 dicembre 2020 11.35.44 CET
> > Oggetto: [ClusterLabs] Antw: [EXT] Recoveing from node failure
> >
> > > Hi!
> > >
> > > Did you take care for special "two node" settings (quorum I
> > > mean)?
> > > When I use "crm_mon -1Arfj", I see something like
> > > " * Current DC: h19 (version 2.0.4+20200616.2deceaa3a-3.3.1-
> > > 2.0.4+20200616.2deceaa3a) - partition with quorum"
> > >
> > > What do you see?
> > >
> > > Regards,
> > > Ulrich
> > >
> > > >>> Gabriele Bulfon <gbulfon at sonicle.com> schrieb am 11.12.2020
> > > um 11:23 in
> > > Nachricht <350849824.6300.1607682209284 at www>:
> > > > Hi, I finally could manage stonith with IPMI in my 2 nodes
> > > XStreamOS/illumos
> > > > storage cluster.
> > > > I have NFS IPs and shared storage zpool moving from one node or
> > > the other,
> > > > and stonith controllin ipmi powering off when something is not
> > > clear.
> > > >
> > > > What happens now is that if I shutdown 2nd node, I see the
> > > OFFLINE status
> > > > from node 1 and everything is up and running, and this is ok:
> > > >
> > > > Online: [ xstha1 ]
> > > > OFFLINE: [ xstha2 ]
> > > > Full list of resources:
> > > > xstha1_san0_IP (ocf::heartbeat:IPaddr): Started xstha1
> > > > xstha2_san0_IP (ocf::heartbeat:IPaddr): Started xstha1
> > > > xstha1-stonith (stonith:external/ipmi): Started xstha1
> > > > xstha2-stonith (stonith:external/ipmi): Started xstha1
> > > > zpool_data (ocf::heartbeat:ZFS): Started xstha1
> > > > But if also reboot 1st node, it starts with the UNCLEAN state,
> > > nothing is
> > > > running, so I clearstate of node 2, but resources are not
> > > started:
> > > >
> > > > Online: [ xstha1 ]
> > > > OFFLINE: [ xstha2 ]
> > > > Full list of resources:
> > > > xstha1_san0_IP (ocf::heartbeat:IPaddr): Stopped
> > > > xstha2_san0_IP (ocf::heartbeat:IPaddr): Stopped
> > > > xstha1-stonith (stonith:external/ipmi): Stopped
> > > > xstha2-stonith (stonith:external/ipmi): Stopped
> > > > zpool_data (ocf::heartbeat:ZFS): Stopped
> > > > I tried restarting zpool_data or other resources:
> > > > # crm resource start zpool_data
> > > > but nothing happens!
> > > > How can I recover from this state? Node2 needs to stay down,
> > > but I want
> > > > node1 to work.
> > > > Thanks!
> > > > Gabriele
> > > >
> > > >
> > > > Sonicle S.r.l. : http://www.sonicle.com
> > > > Music: http://www.gabrielebulfon.com
> > > > eXoplanets :
> > > https://gabrielebulfon.bandcamp.com/album/exoplanets
> > > >
> > >
> > >
> > >
> > >
> > > _______________________________________________
> > > Manage your subscription:
> > > https://lists.clusterlabs.org/mailman/listinfo/users
> > >
> > > ClusterLabs home: https://www.clusterlabs.org/
> > >
> > >
> >
> > _______________________________________________
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > ClusterLabs home: https://www.clusterlabs.org/
>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
--
Ken Gaillot <kgaillot at redhat.com>
More information about the Users
mailing list