[ClusterLabs] shutdown and restart of complete cluster due to power outage with UPS
Andrei Borzenkov
arvidjaar at gmail.com
Tue Jan 22 12:35:39 EST 2019
22.01.2019 20:00, Ken Gaillot пишет:
> On Tue, 2019-01-22 at 16:52 +0100, Lentes, Bernd wrote:
>> Hi,
>>
>> we have a new UPS which has enough charge to provide our 2-node
>> cluster with the periphery (SAN, switches ...) for a resonable time.
>> I'm currently thinking of the shutdown- and restart-procedure of the
>> complete cluster when the power is lost and does not come back soon.
>> Then cluster is provided via UPS, but that does not work infinite. So
>> i have to shutdown the complete cluster.
>> I have the possibility to run scripts on each node which are
>> triggered by the UPS.
>>
>> My shutdown procedure is:
>> crm -w node standby node1
>> resources are migrated to node2
>> systemctl stop pacemaker
>> stops also corosync
>> node is not fenced ! (because of standby ?)
>
> Clean shutdowns don't get fenced. As long as the exiting node can tell
> the rest of the cluster that it's leaving, everything can be
> coordinated gracefully.
>
>> systemctl poweroff
>> clean shutdown of node1
>>
>> crm -w node standby node2
>> clean stop of resources
>> systemctl stop pacmeaker
>> systemctl poweroff
>>
>> The scripts would be executed form node2, via ssh for node1.
>> What do you think about it ?
>
> Good plan, though perhaps there should be some allowance for the case
> in which only node1 is running when the power dies.
>
>> Now the restart, which makes me trouble.
>> Currently i want to restart the cluster manually, because i'm not
>> completly familiar with pacemaker and a bit afraid of getting
>> constellations
>> due to automotization i didn't think of before.
>>
>> I can do that from anywhere because both nodes have ILO-cards.
>>
>> I start e.g. node1 with power button.
>>
>> systemctl start corosync
>> systemctl start pacemaker
>> corosync and pacemaker don't start automatically, i read that
>> several times as a recommendation.
>> Now my first problem. Let's assume the other node is broken. But i
>> still want to get
>> resources running. My no-quorum-policy is ignore. That should be
>> fine. But i have this setup now and don't get the resources running
>> automatically.
>
> I'm guessing you have corosync 2's wait_for_all set (probably
> implicitly by two_node). This is a safeguard for the situation where
> both nodes are booted up but can't see each other.
>
> If you're sure the other node is down, you can disable wait_for_all
> before starting the node. (I'm not sure if this can be changed while
> corosync is already running.)
>
>>
>> crm_mon says:
>> =====================================================================
>> ===
>> Stack: corosync
>> Current DC: ha-idg-1 (version 1.1.19+20180928.0d2680780-1.8-
>> 1.1.19+20180928.0d2680780) - partition WITHOUT quorum
>> Last updated: Tue Jan 22 15:34:19 2019
>> Last change: Tue Jan 22 13:39:14 2019 by root via crm_attribute on
>> ha-idg-1
>>
>> 2 nodes configured
>> 13 resources configured
>>
>> Node ha-idg-1: online
>> Node ha-idg-2: UNCLEAN (offline)
>>
>> Inactive resources:
>>
>> fence_ha-idg-2 (stonith:fence_ilo2): Stopped
>> fence_ha-idg-1 (stonith:fence_ilo4): Stopped
>> Clone Set: cl_share [gr_share]
>> Stopped: [ ha-idg-1 ha-idg-2 ]
>> vm_mausdb (ocf::heartbeat:VirtualDomain): Stopped
>> vm_sim (ocf::heartbeat:VirtualDomain): Stopped
>> vm_geneious (ocf::heartbeat:VirtualDomain): Stopped
>> Clone Set: cl_SNMP [SNMP]
>> Stopped: [ ha-idg-1 ha-idg-2 ]
>>
>> Node Attributes:
>> * Node ha-idg-1:
>> + maintenance : off
>>
>> Migration Summary:
>> * Node ha-idg-1:
>>
>> Failed Fencing Actions:
>> * Off of ha-idg-2 failed: delegate=, client=crmd.9938, origin=ha-idg-
>> 1,
>> last-failed='Tue Jan 22 15:34:17 2019'
>>
This is another problem - if cluster requires stonith, it won't statr
resources with another node UNCLEAN and fencing attempt apparently failed.
>> Negative Location Constraints:
>> loc_fence_ha-idg-1 prevents fence_ha-idg-1 from running on ha-
>> idg-1
>> loc_fence_ha-idg-2 prevents fence_ha-idg-2 from running on ha-
>> idg-2
>> =====================================================================
>> Cluster does not have quorum but that shouldn't be a problem.
>> corosync and pacemaker are started.
>> Why do the resources don't start automatically ? All target-roles are
>> set to "started".
>> Is it because the fencing didn't succeed ? The status of ha-idg-2
>> isn't clear for the cluster ?
>> If yes, what can i do ?
>>
>> Bernd
>>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
More information about the Users
mailing list