<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<div class="moz-cite-prefix">On 7/11/19 6:52 AM, Users wrote:<br>
</div>
<blockquote type="cite" cite="mid:CAA91j0XLHTUG44ew7CN=ebtaOa2H5bQDYoNgDSe-yiuHWbF23Q@mail.gmail.com">
<pre class="moz-quote-pre" wrap="">On Thu, Jul 11, 2019 at 12:58 PM Lars Ellenberg
<a class="moz-txt-link-rfc2396E" href="mailto:lars.ellenberg@linbit.com"><lars.ellenberg@linbit.com></a> wrote:
</pre>
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">
On Wed, Jul 10, 2019 at 06:15:56PM +0000, Michael Powell wrote:
</pre>
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">Thanks to you and Andrei for your responses. In our particular
situation, we want to be able to operate with either node in
stand-alone mode, or with both nodes protected by HA. I did not
mention this, but I am working on upgrading our product
from a version which used Pacemaker version 1.0.13 and Heartbeat
to run under CentOS 7.6 (later 8.0).
The older version did not exhibit this behavior, hence my concern.
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">
Heartbeat by default has much less aggressive timeout settings,
and clearly distinguishes between "deadtime", and "initdead",
basically a "wait_for_all" with timeout: how long to wait for other
nodes during startup before declaring them dead and proceeding in
the startup sequence, ultimately fencing unseen nodes anyways.
Pacemaker itself has "dc-deadtime", documented as
"How long to wait for a response from other nodes during startup.",
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">
Documentation is incomplete, it is timeout to start DC (re-)election,
so it also applies to current DC failure and will delay recovery.
At least that is how I understand it :)
</pre>
</blockquote>
<p>Along these same lines, a drawback to extending dc-deadtime is that the cluster always waits for dc-deadtime to expire before starting resources, even if all nodes have joined. So if you have a long dc-deadtime, the cluster will always wait at least that
long before starting resources, even if all nodes have joined.<br>
</p>
<p>I mentioned this in a previous post, but we dealt with this by synchronizing the starting of Corosync and Pacemaker with a simple ExecStartPre systemd script:
</p>
<pre style="white-space: pre-wrap; caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0); font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; widows: auto; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration: none;">
# cat /etc/systemd/system/corosync.service.d/ha_wait.conf
[Service]
ExecStartPre=/sbin/ha_wait.sh
TimeoutStartSec=11min
where ha_wait.sh has something like:
#!/bin/bash
timeout=600
peer=<hostname of HA peer>
echo "Waiting for ${peer}"
peerup() {
systemctl -H ${peer} is-active --quiet corosync.service 2> /dev/null && return 0
return 1
}
start=${SECONDS}
while ! peerup && [ $((SECONDS-start)) -lt ${timeout} ]; do
echo -n .
sleep 5
done
peerup && echo "${peer} is up, starting HA" || echo "${peer} not up after ${timeout} starting HA alone"
</pre>
This will cause Corosync startup to block while waiting for the partner node to begin starting Corosync. Once the partner begins starting Corosync, both nodes will start Corosync/Pacemaker at nearly the same time. If one node never comes up, then the partner
will wait 10 minutes before starting, after which the node will be fenced (startup fencing and subsequent resource startup will only happen will only occur if no-quorum-policy is set to ignore)<br>
<br>
Thanks,<br>
Chris<br>
<blockquote type="cite" cite="mid:CAA91j0XLHTUG44ew7CN=ebtaOa2H5bQDYoNgDSe-yiuHWbF23Q@mail.gmail.com">
<pre class="moz-quote-pre" wrap="">
</pre>
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">but the 20s default of that in current Pacemaker is much likely
shorter than what you had as initdead in your "old" setup.
So maybe if you set dc-deadtime to two minutes or something,
that would give you the "expected" behavior?
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">
If you call two isolated single node clusters running the same
applications likely using the same shared resources "expected", just
set startup-fencing=false, but then do not complain about data
corruption.
_______________________________________________
Manage your subscription:
<a class="moz-txt-link-freetext" href="https://lists.clusterlabs.org/mailman/listinfo/users">https://lists.clusterlabs.org/mailman/listinfo/users</a>
ClusterLabs home: <a class="moz-txt-link-freetext" href="https://www.clusterlabs.org/">https://www.clusterlabs.org/</a>
</pre>
</blockquote>
<p><br>
</p>
</body>
</html>