[ClusterLabs] shutdown and restart of complete cluster due to power outage with UPS

Tue Jan 22 12:00:07 EST 2019

On Tue, 2019-01-22 at 16:52 +0100, Lentes, Bernd wrote:
> Hi,
> 
> we have a new UPS which has enough charge to provide our 2-node
> cluster with the periphery (SAN, switches ...) for a resonable time.
> I'm currently thinking of the shutdown- and restart-procedure of the
> complete cluster when the power is lost and does not come back soon.
> Then cluster is provided via UPS, but that does not work infinite. So
> i have to shutdown the complete cluster.
> I have the possibility to run scripts on each node which are
> triggered by the UPS.
> 
> My shutdown procedure is:
> crm -w node standby node1
>   resources are migrated to node2
> systemctl stop pacemaker
>   stops also corosync
>   node is not fenced ! (because of standby ?)

Clean shutdowns don't get fenced. As long as the exiting node can tell
the rest of the cluster that it's leaving, everything can be
coordinated gracefully.

> systemctl poweroff
>   clean shutdown of node1
> 
> crm -w node standby node2
>   clean stop of resources
> systemctl stop pacmeaker
> systemctl poweroff
> 
> The scripts would be executed form node2, via ssh for node1.
> What do you think about it ?

Good plan, though perhaps there should be some allowance for the case
in which only node1 is running when the power dies.

> Now the restart, which makes me trouble.
> Currently i want to restart the cluster manually, because i'm not
> completly familiar with pacemaker and a bit afraid of getting
> constellations 
> due to automotization i didn't think of before.
>
> I can do that from anywhere because both nodes have ILO-cards.
> 
> I start e.g. node1 with power button.
> 
> systemctl start corosync
> systemctl start pacemaker
>   corosync and pacemaker don't start automatically, i read that
> several times as a recommendation.
> Now my first problem. Let's assume the other node is broken. But i
> still want to get
> resources running. My no-quorum-policy is ignore. That should be
> fine. But i have this setup now and don't get the resources running
> automatically.

I'm guessing you have corosync 2's wait_for_all set (probably
implicitly by two_node). This is a safeguard for the situation where
both nodes are booted up but can't see each other.

If you're sure the other node is down, you can disable wait_for_all
before starting the node. (I'm not sure if this can be changed while
corosync is already running.)

> 
> crm_mon says:
> =====================================================================
> ===
> Stack: corosync
> Current DC: ha-idg-1 (version 1.1.19+20180928.0d2680780-1.8-
> 1.1.19+20180928.0d2680780) - partition WITHOUT quorum
> Last updated: Tue Jan 22 15:34:19 2019
> Last change: Tue Jan 22 13:39:14 2019 by root via crm_attribute on
> ha-idg-1
> 
> 2 nodes configured
> 13 resources configured
> 
> Node ha-idg-1: online
> Node ha-idg-2: UNCLEAN (offline)
> 
> Inactive resources:
> 
> fence_ha-idg-2  (stonith:fence_ilo2):   Stopped
> fence_ha-idg-1  (stonith:fence_ilo4):   Stopped
>  Clone Set: cl_share [gr_share]
>      Stopped: [ ha-idg-1 ha-idg-2 ]
> vm_mausdb       (ocf::heartbeat:VirtualDomain): Stopped
> vm_sim  (ocf::heartbeat:VirtualDomain): Stopped
> vm_geneious     (ocf::heartbeat:VirtualDomain): Stopped
>  Clone Set: cl_SNMP [SNMP]
>      Stopped: [ ha-idg-1 ha-idg-2 ]
> 
> Node Attributes:
> * Node ha-idg-1:
>     + maintenance                       : off
> 
> Migration Summary:
> * Node ha-idg-1:
> 
> Failed Fencing Actions:
> * Off of ha-idg-2 failed: delegate=, client=crmd.9938, origin=ha-idg-
> 1,
>     last-failed='Tue Jan 22 15:34:17 2019'
> 
> Negative Location Constraints:
>  loc_fence_ha-idg-1     prevents fence_ha-idg-1 from running on ha-
> idg-1
>  loc_fence_ha-idg-2     prevents fence_ha-idg-2 from running on ha-
> idg-2
> =====================================================================
> Cluster does not have quorum but that shouldn't be a problem.
> corosync and pacemaker are started.
> Why do the resources don't start automatically ? All target-roles are
> set to "started".
> Is it because the fencing didn't succeed ? The status of ha-idg-2
> isn't clear for the cluster ?
> If yes, what can i do ?
> 
> Bernd
>