<html>

<head>

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

</head>

<body text="#000000" bgcolor="#FFFFFF">

<div class="moz-cite-prefix">On 7/11/19 6:52 AM, Users wrote:<br>

</div>

<blockquote type="cite" cite="mid:CAA91j0XLHTUG44ew7CN=ebtaOa2H5bQDYoNgDSe-yiuHWbF23Q@mail.gmail.com">

<pre class="moz-quote-pre" wrap="">On Thu, Jul 11, 2019 at 12:58 PM Lars Ellenberg

<a class="moz-txt-link-rfc2396E" href="mailto:lars.ellenberg@linbit.com"><lars.ellenberg@linbit.com></a> wrote:

</pre>

<blockquote type="cite">

<pre class="moz-quote-pre" wrap="">

On Wed, Jul 10, 2019 at 06:15:56PM +0000, Michael Powell wrote:

</pre>

<blockquote type="cite">

<pre class="moz-quote-pre" wrap="">Thanks to you and Andrei for your responses.  In our particular

situation, we want to be able to operate with either node in

stand-alone mode, or with both nodes protected by HA.  I did not

mention this, but I am working on upgrading our product

from a version which used Pacemaker version 1.0.13 and Heartbeat

to run under CentOS 7.6 (later 8.0).

The older version did not exhibit this behavior, hence my concern.

</pre>

</blockquote>

<pre class="moz-quote-pre" wrap="">

Heartbeat by default has much less aggressive timeout settings,

and clearly distinguishes between "deadtime", and "initdead",

basically a "wait_for_all" with timeout: how long to wait for other

nodes during startup before declaring them dead and proceeding in

the startup sequence, ultimately fencing unseen nodes anyways.

Pacemaker itself has "dc-deadtime", documented as

"How long to wait for a response from other nodes during startup.",

</pre>

</blockquote>

<pre class="moz-quote-pre" wrap="">

Documentation is incomplete, it is timeout to start DC (re-)election,

so it also applies to current DC failure and will delay recovery.

At least that is how I understand it :)

</pre>

</blockquote>

<p>Along these same lines, a drawback to extending dc-deadtime is that the cluster always waits for dc-deadtime to expire before starting resources, even if all nodes have joined.  So if you have a long dc-deadtime, the cluster will always wait at least that

 long before starting resources, even if all nodes have joined.<br>

</p>

<p>I mentioned this in a previous post, but we dealt with this by synchronizing the starting of Corosync and Pacemaker with a simple ExecStartPre systemd script:

</p>

<pre style="white-space: pre-wrap; caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0); font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; widows: auto; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration: none;">

# cat /etc/systemd/system/corosync.service.d/ha_wait.conf

[Service]

ExecStartPre=/sbin/ha_wait.sh

TimeoutStartSec=11min

where ha_wait.sh has something like:

#!/bin/bash

timeout=600

peer=<hostname of HA peer>

echo "Waiting for ${peer}"

peerup() {

  systemctl -H ${peer} is-active --quiet corosync.service 2> /dev/null && return 0

  return 1

}

start=${SECONDS}

while ! peerup && [ $((SECONDS-start)) -lt ${timeout} ]; do

  echo -n .

  sleep 5

done

peerup && echo "${peer} is up, starting HA" || echo "${peer} not up after ${timeout} starting HA alone"

</pre>

This will cause Corosync startup to block while waiting for the partner node to begin starting Corosync. Once the partner begins starting Corosync, both nodes will start Corosync/Pacemaker at nearly the same time. If one node never comes up, then the partner

 will wait 10 minutes before starting, after which the node will be fenced (startup fencing and subsequent resource startup will only happen will only occur if no-quorum-policy is set to ignore)<br>

<br>

Thanks,<br>

Chris<br>

<blockquote type="cite" cite="mid:CAA91j0XLHTUG44ew7CN=ebtaOa2H5bQDYoNgDSe-yiuHWbF23Q@mail.gmail.com">

<pre class="moz-quote-pre" wrap="">

</pre>

<blockquote type="cite">

<pre class="moz-quote-pre" wrap="">but the 20s default of that in current Pacemaker is much likely

shorter than what you had as initdead in your "old" setup.

So maybe if you set dc-deadtime to two minutes or something,

that would give you the "expected" behavior?

</pre>

</blockquote>

<pre class="moz-quote-pre" wrap="">

If you call two isolated single node clusters running the same

applications likely using the same shared resources "expected", just

set startup-fencing=false, but then do not complain about data

corruption.

_______________________________________________

Manage your subscription:

<a class="moz-txt-link-freetext" href="https://lists.clusterlabs.org/mailman/listinfo/users">https://lists.clusterlabs.org/mailman/listinfo/users</a>

ClusterLabs home: <a class="moz-txt-link-freetext" href="https://www.clusterlabs.org/">https://www.clusterlabs.org/</a>

</pre>

</blockquote>

<p><br>

</p>

</body>

</html>