<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<div class="moz-cite-prefix">On 2022-02-01 11:16, Lentes, Bernd
wrote:<br>
</div>
<blockquote type="cite"
cite="mid:2015563303.162228678.1643732160805.JavaMail.zimbra@helmholtz-muenchen.de">
<pre class="moz-quote-pre" wrap="">Hi,
we just experienced two power outages in a few days.
This showed me that our UPS configuration and the handling of resources on the cluster is insufficient.
We have a two-node cluster with SLES 12 SP5 and a Smart-UPS SRT 3000 from APC with Network Management Card.
The UPS is able to buffer the two nodes and some Hardware (SAN, Monitor) for about one hour.
Our resources are Virtual Domains, about 20 of different flavor and version.
Our primary goal is not to bypass as long as possible a power outage but to shutdown all domains correctly after a dedicated time.
I'm currently thinking of waiting for a dedicated time (maybe 15 minutes) and then do a "crm resource stop VirtualDomains" in a script.
I would give the cluster some time for the shutdown (5-10 minutes) and afterwards shutdown the nodes (via script).
I have to keep an eye on if both nodes are running or only one of them.
How is your approach ?
Bernd
</pre>
</blockquote>
<p>I don't know if this will be a useful answer for you, but I
haven't seen anyone else reply. <br>
</p>
<p>In the Anvil!, we use SNMP to collect data on APC UPSes powering
a given cluster. The OIDs we read are at the head of this file,
but the logic to read and collect the data starts here;</p>
<p><a class="moz-txt-link-freetext" href="https://github.com/ClusterLabs/anvil/blob/main/scancore-agents/scan-apc-ups/scan-apc-ups#L3026">https://github.com/ClusterLabs/anvil/blob/main/scancore-agents/scan-apc-ups/scan-apc-ups#L3026</a></p>
<p>Some processing happens in-agent, but mainly the collected data
is written to a generic "power" table (as we support any UPS we
can collect data from). When we're done scanning, we analyze the
data in the 'power' table to decide if we need to shed load
(withdraw and power off nodes to extend runtime), do a complete
graceful shutdown (if the batteries are about to die), or reboot
the nodes after power is restored.</p>
<p>This logic is handled mainly here. First, we figure out which UPS
powers which nodes/clusters, then we pull the data on those
specific UPSes to return a general "power state". <br>
</p>
<p><a class="moz-txt-link-freetext" href="https://github.com/ClusterLabs/anvil/blob/main/Anvil/Tools/ScanCore.pm#L607">https://github.com/ClusterLabs/anvil/blob/main/Anvil/Tools/ScanCore.pm#L607</a></p>
<p>The power state then tells the main daemon what actions to take,
if any (load shed, shut down, restart). That's here;</p>
<p><a class="moz-txt-link-freetext" href="https://github.com/ClusterLabs/anvil/blob/main/Anvil/Tools/ScanCore.pm#L1541">https://github.com/ClusterLabs/anvil/blob/main/Anvil/Tools/ScanCore.pm#L1541</a></p>
<p>This is super high level, and much of the specifics are related
to the Anvil! cluster, but it hopefully gives you a starting point
on how to approach the problem. We've been doing it this way for
many years with really good effect.</p>
<p>Cheers<br>
</p>
<p><br>
</p>
<pre class="moz-signature" cols="72">--
Digimer
Papers and Projects: <a class="moz-txt-link-freetext" href="https://alteeve.com/w/">https://alteeve.com/w/</a>
"I am, somehow, less interested in the weight and convolutions of Einstein’s brain than in the near certainty that people of equal talent have lived and died in cotton fields and sweatshops." - Stephen Jay Gould</pre>
</body>
</html>