[ClusterLabs] shutdown and restart of complete cluster due to power outage with UPS

Tue Jan 22 12:35:39 EST 2019

22.01.2019 20:00, Ken Gaillot пишет:
> On Tue, 2019-01-22 at 16:52 +0100, Lentes, Bernd wrote:
>> Hi,
>>
>> we have a new UPS which has enough charge to provide our 2-node
>> cluster with the periphery (SAN, switches ...) for a resonable time.
>> I'm currently thinking of the shutdown- and restart-procedure of the
>> complete cluster when the power is lost and does not come back soon.
>> Then cluster is provided via UPS, but that does not work infinite. So
>> i have to shutdown the complete cluster.
>> I have the possibility to run scripts on each node which are
>> triggered by the UPS.
>>
>> My shutdown procedure is:
>> crm -w node standby node1
>>   resources are migrated to node2
>> systemctl stop pacemaker
>>   stops also corosync
>>   node is not fenced ! (because of standby ?)
> 
> Clean shutdowns don't get fenced. As long as the exiting node can tell
> the rest of the cluster that it's leaving, everything can be
> coordinated gracefully.
> 
>> systemctl poweroff
>>   clean shutdown of node1
>>
>> crm -w node standby node2
>>   clean stop of resources
>> systemctl stop pacmeaker
>> systemctl poweroff
>>
>> The scripts would be executed form node2, via ssh for node1.
>> What do you think about it ?
> 
> Good plan, though perhaps there should be some allowance for the case
> in which only node1 is running when the power dies.
> 
>> Now the restart, which makes me trouble.
>> Currently i want to restart the cluster manually, because i'm not
>> completly familiar with pacemaker and a bit afraid of getting
>> constellations 
>> due to automotization i didn't think of before.
>>
>> I can do that from anywhere because both nodes have ILO-cards.
>>
>> I start e.g. node1 with power button.
>>
>> systemctl start corosync
>> systemctl start pacemaker
>>   corosync and pacemaker don't start automatically, i read that
>> several times as a recommendation.
>> Now my first problem. Let's assume the other node is broken. But i
>> still want to get
>> resources running. My no-quorum-policy is ignore. That should be
>> fine. But i have this setup now and don't get the resources running
>> automatically.
> 
> I'm guessing you have corosync 2's wait_for_all set (probably
> implicitly by two_node). This is a safeguard for the situation where
> both nodes are booted up but can't see each other.
> 
> If you're sure the other node is down, you can disable wait_for_all
> before starting the node. (I'm not sure if this can be changed while
> corosync is already running.)
> 
>>
>> crm_mon says:
>> =====================================================================
>> ===
>> Stack: corosync
>> Current DC: ha-idg-1 (version 1.1.19+20180928.0d2680780-1.8-
>> 1.1.19+20180928.0d2680780) - partition WITHOUT quorum
>> Last updated: Tue Jan 22 15:34:19 2019
>> Last change: Tue Jan 22 13:39:14 2019 by root via crm_attribute on
>> ha-idg-1
>>
>> 2 nodes configured
>> 13 resources configured
>>
>> Node ha-idg-1: online
>> Node ha-idg-2: UNCLEAN (offline)
>>
>> Inactive resources:
>>
>> fence_ha-idg-2  (stonith:fence_ilo2):   Stopped
>> fence_ha-idg-1  (stonith:fence_ilo4):   Stopped
>>  Clone Set: cl_share [gr_share]
>>      Stopped: [ ha-idg-1 ha-idg-2 ]
>> vm_mausdb       (ocf::heartbeat:VirtualDomain): Stopped
>> vm_sim  (ocf::heartbeat:VirtualDomain): Stopped
>> vm_geneious     (ocf::heartbeat:VirtualDomain): Stopped
>>  Clone Set: cl_SNMP [SNMP]
>>      Stopped: [ ha-idg-1 ha-idg-2 ]
>>
>> Node Attributes:
>> * Node ha-idg-1:
>>     + maintenance                       : off
>>
>> Migration Summary:
>> * Node ha-idg-1:
>>
>> Failed Fencing Actions:
>> * Off of ha-idg-2 failed: delegate=, client=crmd.9938, origin=ha-idg-
>> 1,
>>     last-failed='Tue Jan 22 15:34:17 2019'
>>

This is another problem - if cluster requires stonith, it won't statr
resources with another node UNCLEAN and fencing attempt apparently failed.

>> Negative Location Constraints:
>>  loc_fence_ha-idg-1     prevents fence_ha-idg-1 from running on ha-
>> idg-1
>>  loc_fence_ha-idg-2     prevents fence_ha-idg-2 from running on ha-
>> idg-2
>> =====================================================================
>> Cluster does not have quorum but that shouldn't be a problem.
>> corosync and pacemaker are started.
>> Why do the resources don't start automatically ? All target-roles are
>> set to "started".
>> Is it because the fencing didn't succeed ? The status of ha-idg-2
>> isn't clear for the cluster ?
>> If yes, what can i do ?
>>
>> Bernd
>>
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>