[ClusterLabs] shutdown and restart of complete cluster due to power outage with UPS

Lentes, Bernd bernd.lentes at helmholtz-muenchen.de
Tue Jan 22 10:52:54 EST 2019


we have a new UPS which has enough charge to provide our 2-node cluster with the periphery (SAN, switches ...) for a resonable time.
I'm currently thinking of the shutdown- and restart-procedure of the complete cluster when the power is lost and does not come back soon.
Then cluster is provided via UPS, but that does not work infinite. So i have to shutdown the complete cluster.
I have the possibility to run scripts on each node which are triggered by the UPS.

My shutdown procedure is:
crm -w node standby node1
  resources are migrated to node2
systemctl stop pacemaker
  stops also corosync
  node is not fenced ! (because of standby ?)
systemctl poweroff
  clean shutdown of node1

crm -w node standby node2
  clean stop of resources
systemctl stop pacmeaker
systemctl poweroff

The scripts would be executed form node2, via ssh for node1.
What do you think about it ?

Now the restart, which makes me trouble.
Currently i want to restart the cluster manually, because i'm not completly familiar with pacemaker and a bit afraid of getting constellations 
due to automotization i didn't think of before.
I can do that from anywhere because both nodes have ILO-cards.

I start e.g. node1 with power button.

systemctl start corosync
systemctl start pacemaker
  corosync and pacemaker don't start automatically, i read that several times as a recommendation.
Now my first problem. Let's assume the other node is broken. But i still want to get
resources running. My no-quorum-policy is ignore. That should be fine. But i have this setup now and don't get the resources running automatically.

crm_mon says:
Stack: corosync
Current DC: ha-idg-1 (version 1.1.19+20180928.0d2680780-1.8-1.1.19+20180928.0d2680780) - partition WITHOUT quorum
Last updated: Tue Jan 22 15:34:19 2019
Last change: Tue Jan 22 13:39:14 2019 by root via crm_attribute on ha-idg-1

2 nodes configured
13 resources configured

Node ha-idg-1: online
Node ha-idg-2: UNCLEAN (offline)

Inactive resources:

fence_ha-idg-2  (stonith:fence_ilo2):   Stopped
fence_ha-idg-1  (stonith:fence_ilo4):   Stopped
 Clone Set: cl_share [gr_share]
     Stopped: [ ha-idg-1 ha-idg-2 ]
vm_mausdb       (ocf::heartbeat:VirtualDomain): Stopped
vm_sim  (ocf::heartbeat:VirtualDomain): Stopped
vm_geneious     (ocf::heartbeat:VirtualDomain): Stopped
 Clone Set: cl_SNMP [SNMP]
     Stopped: [ ha-idg-1 ha-idg-2 ]

Node Attributes:
* Node ha-idg-1:
    + maintenance                       : off

Migration Summary:
* Node ha-idg-1:

Failed Fencing Actions:
* Off of ha-idg-2 failed: delegate=, client=crmd.9938, origin=ha-idg-1,
    last-failed='Tue Jan 22 15:34:17 2019'

Negative Location Constraints:
 loc_fence_ha-idg-1     prevents fence_ha-idg-1 from running on ha-idg-1
 loc_fence_ha-idg-2     prevents fence_ha-idg-2 from running on ha-idg-2
Cluster does not have quorum but that shouldn't be a problem. corosync and pacemaker are started.
Why do the resources don't start automatically ? All target-roles are set to "started".
Is it because the fencing didn't succeed ? The status of ha-idg-2 isn't clear for the cluster ?
If yes, what can i do ?



