[Pacemaker] Creating a safe cluster-node shutdown script (for when UPS goes OnBattery+LowBattery)

Fri Jul 4 12:52:33 EDT 2014

> Date: Fri, 4 Jul 2014 23:17:07 +0900
> From: lists at alteeve.ca
> To: pacemaker at oss.clusterlabs.org
> Subject: Re: [Pacemaker] Creating a safe cluster-node shutdown script (for when UPS goes OnBattery+LowBattery)
> 
> On 04/07/14 02:16 PM, Giuseppe Ragusa wrote:
> > Hi all,
> > I'm trying to create a script as per subject (on CentOS 6.5,
> > CMAN+Pacemaker, only DRBD+KVM active/passive resources; SNMP-UPS
> > monitored by NUT).
> >
> > Ideally I think that each node should stop (disable) all locally-running
> > VirtualDomain resources (doing so cleanly demotes than downs the DRBD
> > resources underneath), then put itself in standby and finally shutdown.
> >
> > On further startup, manual intervention would be required to unstandby
> > all nodes and enable resources (nodes already in standby and resources
> > already disabled before blackout should be manually distinguished).
> >
> > Is this strategy conceptually safe?
> >
> > Unfortunately, various searches have turned out no "prior art" :)
> 
> I started work on something similar with apcupsd (first I had to make it 
> work with multiple UPSes, which I did). Then I decided not to actually 
> implement, and decided instead to leave it up to an admin to decide 
> how/when/if to initiate a graceful shutdown.
> 
> My rationale was that this placed way too much potential damage in the 
> hands of, effectively, a single trigger. One bad bug and you could bring 
> down a perfectly fine cluster.

Perfectly reasonable, in fact I was limiting my effort to a single, narrowly defined case.

> Instead, what I did was ensure that any power event triggered an alert 
> email (x2, as both nodes ran the monitoring app). This way, I (and the 
> client's admins) would be notified immediately if anything happened. 
> Then it was up to us to decide how/if to initiate a graceful shutdown.

My clients business setup is peculiar too: too big to disregard HA 
solutions, but
too small to have staff/consultants on call for "secondary" 
emergencies (like
power going extendedly down during summer storms 
etc.).

> One real-world example;
> 
> A couple months ago, a client's neighborhood was hit with a prolonged 
> power outage. Eventually, we decided to gracefully shut down. However, 
> one of the windows VMs had downloaded and prepped to install about 30 
> updates (no idea how this happened, except windows). Anyway, the VM took 
> more time to shut down than the batteries could support. So half-way 
> through, we withdrew one node and powered it off to shed load and gain 
> battery runtime. This kind of logic can not reasonably be coded into a 
> script.

Enlightening tale!

Thinking of it: I suppose that more VM-intensive needs (VDI etc.) would qualify for VM-specific
HA solutions (like oVirt/OpenStack) where VMs could be treated totally as physical
machines (install UPS agents on the guest OS and let them go); on a "classic" HA clustering
solution instead, I suppose that VMs should be server VMs (or treated like that) and
even Windows admins would know multiple ways (interactive, GPO, registry) to ensure
controlled behaviour of updates installation (tipically "interactive installation during a maintenance
window"). Leaving "install by default on shutdown" on does not speak well for those admins ;>

> My $0.02.
> 
> -- 
> Digimer

Many thanks for your suggestions and shared experiences!

Regards,
Giuseppe

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20140704/960306b4/attachment-0003.html>