[ClusterLabs] [EXTERNAL] Re: "node is unclean" leads to gratuitous reboot

Thu Jul 11 06:52:15 EDT 2019

On Thu, Jul 11, 2019 at 12:58 PM Lars Ellenberg
<lars.ellenberg at linbit.com> wrote:
>
> On Wed, Jul 10, 2019 at 06:15:56PM +0000, Michael Powell wrote:
> > Thanks to you and Andrei for your responses.  In our particular
> > situation, we want to be able to operate with either node in
> > stand-alone mode, or with both nodes protected by HA.  I did not
> > mention this, but I am working on upgrading our product
> > from a version which used Pacemaker version 1.0.13 and Heartbeat
> > to run under CentOS 7.6 (later 8.0).
> > The older version did not exhibit this behavior, hence my concern.
>
> Heartbeat by default has much less aggressive timeout settings,
> and clearly distinguishes between "deadtime", and "initdead",
> basically a "wait_for_all" with timeout: how long to wait for other
> nodes during startup before declaring them dead and proceeding in
> the startup sequence, ultimately fencing unseen nodes anyways.
>
> Pacemaker itself has "dc-deadtime", documented as
> "How long to wait for a response from other nodes during startup.",

Documentation is incomplete, it is timeout to start DC (re-)election,
so it also applies to current DC failure and will delay recovery.

At least that is how I understand it :)

> but the 20s default of that in current Pacemaker is much likely
> shorter than what you had as initdead in your "old" setup.
>
> So maybe if you set dc-deadtime to two minutes or something,
> that would give you the "expected" behavior?
>

If you call two isolated single node clusters running the same
applications likely using the same shared resources "expected", just
set startup-fencing=false, but then do not complain about data
corruption.