[ClusterLabs] shutdown and restart of complete cluster due to power outage with UPS

Ken Gaillot kgaillot at redhat.com
Tue Jan 22 15:24:40 EST 2019


On Tue, 2019-01-22 at 20:35 +0300, Andrei Borzenkov wrote:
> 22.01.2019 20:00, Ken Gaillot пишет:
> > On Tue, 2019-01-22 at 16:52 +0100, Lentes, Bernd wrote:
> > > Hi,
> > > 
> > > we have a new UPS which has enough charge to provide our 2-node
> > > cluster with the periphery (SAN, switches ...) for a resonable
> > > time.
> > > I'm currently thinking of the shutdown- and restart-procedure of
> > > the
> > > complete cluster when the power is lost and does not come back
> > > soon.
> > > Then cluster is provided via UPS, but that does not work
> > > infinite. So
> > > i have to shutdown the complete cluster.
> > > I have the possibility to run scripts on each node which are
> > > triggered by the UPS.
> > > 
> > > My shutdown procedure is:
> > > crm -w node standby node1
> > >   resources are migrated to node2
> > > systemctl stop pacemaker
> > >   stops also corosync
> > >   node is not fenced ! (because of standby ?)
> > 
> > Clean shutdowns don't get fenced. As long as the exiting node can
> > tell
> > the rest of the cluster that it's leaving, everything can be
> > coordinated gracefully.
> > 
> > > systemctl poweroff
> > >   clean shutdown of node1
> > > 
> > > crm -w node standby node2
> > >   clean stop of resources
> > > systemctl stop pacmeaker
> > > systemctl poweroff
> > > 
> > > The scripts would be executed form node2, via ssh for node1.
> > > What do you think about it ?
> > 
> > Good plan, though perhaps there should be some allowance for the
> > case
> > in which only node1 is running when the power dies.
> > 
> > > Now the restart, which makes me trouble.
> > > Currently i want to restart the cluster manually, because i'm not
> > > completly familiar with pacemaker and a bit afraid of getting
> > > constellations 
> > > due to automotization i didn't think of before.
> > > 
> > > I can do that from anywhere because both nodes have ILO-cards.
> > > 
> > > I start e.g. node1 with power button.
> > > 
> > > systemctl start corosync
> > > systemctl start pacemaker
> > >   corosync and pacemaker don't start automatically, i read that
> > > several times as a recommendation.
> > > Now my first problem. Let's assume the other node is broken. But
> > > i
> > > still want to get
> > > resources running. My no-quorum-policy is ignore. That should be
> > > fine. But i have this setup now and don't get the resources
> > > running
> > > automatically.
> > 
> > I'm guessing you have corosync 2's wait_for_all set (probably
> > implicitly by two_node). This is a safeguard for the situation
> > where
> > both nodes are booted up but can't see each other.
> > 
> > If you're sure the other node is down, you can disable wait_for_all
> > before starting the node. (I'm not sure if this can be changed
> > while
> > corosync is already running.)
> > 
> > > 
> > > crm_mon says:
> > > =================================================================
> > > ====
> > > ===
> > > Stack: corosync
> > > Current DC: ha-idg-1 (version 1.1.19+20180928.0d2680780-1.8-
> > > 1.1.19+20180928.0d2680780) - partition WITHOUT quorum
> > > Last updated: Tue Jan 22 15:34:19 2019
> > > Last change: Tue Jan 22 13:39:14 2019 by root via crm_attribute
> > > on
> > > ha-idg-1
> > > 
> > > 2 nodes configured
> > > 13 resources configured
> > > 
> > > Node ha-idg-1: online
> > > Node ha-idg-2: UNCLEAN (offline)
> > > 
> > > Inactive resources:
> > > 
> > > fence_ha-idg-2  (stonith:fence_ilo2):   Stopped
> > > fence_ha-idg-1  (stonith:fence_ilo4):   Stopped
> > >  Clone Set: cl_share [gr_share]
> > >      Stopped: [ ha-idg-1 ha-idg-2 ]
> > > vm_mausdb       (ocf::heartbeat:VirtualDomain): Stopped
> > > vm_sim  (ocf::heartbeat:VirtualDomain): Stopped
> > > vm_geneious     (ocf::heartbeat:VirtualDomain): Stopped
> > >  Clone Set: cl_SNMP [SNMP]
> > >      Stopped: [ ha-idg-1 ha-idg-2 ]
> > > 
> > > Node Attributes:
> > > * Node ha-idg-1:
> > >     + maintenance                       : off
> > > 
> > > Migration Summary:
> > > * Node ha-idg-1:
> > > 
> > > Failed Fencing Actions:
> > > * Off of ha-idg-2 failed: delegate=, client=crmd.9938, origin=ha-
> > > idg-
> > > 1,
> > >     last-failed='Tue Jan 22 15:34:17 2019'
> > > 
> 
> This is another problem - if cluster requires stonith, it won't statr
> resources with another node UNCLEAN and fencing attempt apparently
> failed.

Good point, I missed that. If you're sure the target node is down, you
can tell the cluster that with "stonith_admin --confirm <node>", and it
will treat it as successfully fenced.

> 
> > > Negative Location Constraints:
> > >  loc_fence_ha-idg-1     prevents fence_ha-idg-1 from running on
> > > ha-
> > > idg-1
> > >  loc_fence_ha-idg-2     prevents fence_ha-idg-2 from running on
> > > ha-
> > > idg-2
> > > =================================================================
> > > ====
> > > Cluster does not have quorum but that shouldn't be a problem.
> > > corosync and pacemaker are started.
> > > Why do the resources don't start automatically ? All target-roles 
> > > are
> > > set to "started".
> > > Is it because the fencing didn't succeed ? The status of ha-idg-2
> > > isn't clear for the cluster ?
> > > If yes, what can i do ?
> > > 
> > > Bernd
> > > 
> > 
> > _______________________________________________
> > Users mailing list: Users at clusterlabs.org
> > https://lists.clusterlabs.org/mailman/listinfo/users
> > 
> > Project Home: http://www.clusterlabs.org
> > Getting started: 
> > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> > 
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: 
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org




More information about the Users mailing list