[ClusterLabs] shutdown and restart of complete cluster due to power outage with UPS
Ken Gaillot
kgaillot at redhat.com
Tue Jan 22 12:00:07 EST 2019
On Tue, 2019-01-22 at 16:52 +0100, Lentes, Bernd wrote:
> Hi,
>
> we have a new UPS which has enough charge to provide our 2-node
> cluster with the periphery (SAN, switches ...) for a resonable time.
> I'm currently thinking of the shutdown- and restart-procedure of the
> complete cluster when the power is lost and does not come back soon.
> Then cluster is provided via UPS, but that does not work infinite. So
> i have to shutdown the complete cluster.
> I have the possibility to run scripts on each node which are
> triggered by the UPS.
>
> My shutdown procedure is:
> crm -w node standby node1
> resources are migrated to node2
> systemctl stop pacemaker
> stops also corosync
> node is not fenced ! (because of standby ?)
Clean shutdowns don't get fenced. As long as the exiting node can tell
the rest of the cluster that it's leaving, everything can be
coordinated gracefully.
> systemctl poweroff
> clean shutdown of node1
>
> crm -w node standby node2
> clean stop of resources
> systemctl stop pacmeaker
> systemctl poweroff
>
> The scripts would be executed form node2, via ssh for node1.
> What do you think about it ?
Good plan, though perhaps there should be some allowance for the case
in which only node1 is running when the power dies.
> Now the restart, which makes me trouble.
> Currently i want to restart the cluster manually, because i'm not
> completly familiar with pacemaker and a bit afraid of getting
> constellations
> due to automotization i didn't think of before.
>
> I can do that from anywhere because both nodes have ILO-cards.
>
> I start e.g. node1 with power button.
>
> systemctl start corosync
> systemctl start pacemaker
> corosync and pacemaker don't start automatically, i read that
> several times as a recommendation.
> Now my first problem. Let's assume the other node is broken. But i
> still want to get
> resources running. My no-quorum-policy is ignore. That should be
> fine. But i have this setup now and don't get the resources running
> automatically.
I'm guessing you have corosync 2's wait_for_all set (probably
implicitly by two_node). This is a safeguard for the situation where
both nodes are booted up but can't see each other.
If you're sure the other node is down, you can disable wait_for_all
before starting the node. (I'm not sure if this can be changed while
corosync is already running.)
>
> crm_mon says:
> =====================================================================
> ===
> Stack: corosync
> Current DC: ha-idg-1 (version 1.1.19+20180928.0d2680780-1.8-
> 1.1.19+20180928.0d2680780) - partition WITHOUT quorum
> Last updated: Tue Jan 22 15:34:19 2019
> Last change: Tue Jan 22 13:39:14 2019 by root via crm_attribute on
> ha-idg-1
>
> 2 nodes configured
> 13 resources configured
>
> Node ha-idg-1: online
> Node ha-idg-2: UNCLEAN (offline)
>
> Inactive resources:
>
> fence_ha-idg-2 (stonith:fence_ilo2): Stopped
> fence_ha-idg-1 (stonith:fence_ilo4): Stopped
> Clone Set: cl_share [gr_share]
> Stopped: [ ha-idg-1 ha-idg-2 ]
> vm_mausdb (ocf::heartbeat:VirtualDomain): Stopped
> vm_sim (ocf::heartbeat:VirtualDomain): Stopped
> vm_geneious (ocf::heartbeat:VirtualDomain): Stopped
> Clone Set: cl_SNMP [SNMP]
> Stopped: [ ha-idg-1 ha-idg-2 ]
>
> Node Attributes:
> * Node ha-idg-1:
> + maintenance : off
>
> Migration Summary:
> * Node ha-idg-1:
>
> Failed Fencing Actions:
> * Off of ha-idg-2 failed: delegate=, client=crmd.9938, origin=ha-idg-
> 1,
> last-failed='Tue Jan 22 15:34:17 2019'
>
> Negative Location Constraints:
> loc_fence_ha-idg-1 prevents fence_ha-idg-1 from running on ha-
> idg-1
> loc_fence_ha-idg-2 prevents fence_ha-idg-2 from running on ha-
> idg-2
> =====================================================================
> Cluster does not have quorum but that shouldn't be a problem.
> corosync and pacemaker are started.
> Why do the resources don't start automatically ? All target-roles are
> set to "started".
> Is it because the fencing didn't succeed ? The status of ha-idg-2
> isn't clear for the cluster ?
> If yes, what can i do ?
>
> Bernd
>
More information about the Users
mailing list