[ClusterLabs] Master/slave failover does not work as expected

Tue Aug 20 14:55:47 EDT 2019

On 15/08/19 17:03 +0000, Michael Powell wrote:
> First, thanks to all for their responses.  With your help, I'm
> steadily gaining competence WRT HA, albeit slowly.
> 
> I've basically followed Harvey's workaround suggestion, and the
> failover I hoped for takes effect quite quickly.  I nevertheless
> remain puzzled about why our legacy code, based upon Pacemaker
> 1.0/Heartbeat, works satisfactorily w/o such changes, 
> 
> Here's what I've done.  First, based upon responses to my post, I've
> implemented the following commands when setting up the cluster:
>> crm_resource  -r SS16201289RN00023 --meta -p resource-stickiness -v 0    # (someone asserted that this was unnecessary)
>> crm_resource  --meta -r SS16201289RN00023 -p migration-threshold -v 1
>> crm_resource  -r SS16201289RN00023 --meta -p failure-timeout -v 10
>> crm_resource  -r SS16201289RN00023 --meta -p cluster-recheck-interval -v 15
> 
> In addition, I've added "-l reboot" to those instances where
> 'crm_master' is invoked by the RA to change resource scores.  I also
> found a location constraint in our setup that I couldn't understand
> the need for, and removed it.
> 
> After doing this, in my initial tests, I found that after 'kill -9
> <pid>' was issued to the master, the slave instance on the other
> node was promoted to master within a few seconds.  However, it took
> 60 seconds before the killed resource was restarted.  In examining
> the cib.xml file, I found an "rsc-options-failure-timeout", which
> was set to "1min".  Thinking "aha!", I added the following line
>> crm_attribute --type rsc_defaults --name failure-timeout --update 15
> 
> Sadly, this does not appear to have had any impact.

It won't have any impact on your SS16201289RN00023 resource since it
has it's own (closer, hence overriding such outermost fallback value,
barring the deferring to built-in default when that default would not
be set explicitly) failure timeout set to value of 10 seconds per
above.

Admittedly, the documentation would use a precize formalization
of the precedence rules.

Anyway, this whole seems moot to me, see below.

> So, while the good news is that the failover occurs as I'd hoped
> for, the time required to restart the failed resource seems
> excessive.  Apparently setting the failure-timeout values isn't
> sufficient.  As an experiment, I issued a "crm_resource -r
> <resourceid> --cleanup" command shortly after the failover took
> effect and found that the resource was quickly restarted.  Is that
> the recommended procedure?  If so, is changing the "failure-timeout"
> and "cluster-recheck-interval" really necessary?
> 
> Finally, for a period of about 20 seconds while the resource is
> being restarted (it takes about 30s to start up the resource's
> application), it appears that both nodes are masters.  E.g. here's
> the 'crm_mon' output.
> vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
> [root at mgraid-16201289RN00023-0 bin]# date;crm_mon -1
> Thu Aug 15 09:44:59 PDT 2019
> Stack: corosync
> Current DC: mgraid-16201289RN00023-0 (version 1.1.19-8.el7-c3c624ea3d) - partition with quorum
> Last updated: Thu Aug 15 09:44:59 2019
> Last change: Thu Aug 15 06:26:39 2019 by root via crm_attribute on mgraid-16201289RN00023-0
> 
> 2 nodes configured
> 4 resources configured
> 
> Online: [ mgraid-16201289RN00023-0 mgraid-16201289RN00023-1 ]
> 
> Active resources:
> 
>  Clone Set: mgraid-stonith-clone [mgraid-stonith]
>      Started: [ mgraid-16201289RN00023-0 mgraid-16201289RN00023-1 ]
>  Master/Slave Set: ms-SS16201289RN00023 [SS16201289RN00023]
>      Masters: [ mgraid-16201289RN00023-0 mgraid-16201289RN00023-1 ]
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> This seems odd.  Eventually the just-started resource reverts to a slave, but this doesn't make sense to me.
> 
> For the curious, I've attached the 'cibadmin --query' output from
> the DC node, taken just prior to issuing the 'kill -9' command, as
> well as the corosync.log file from the DC following the 'kill -9'
> command.

Thanks for attaching both, since your above described situation starts
to make a bit more sense to me.

I think you've observed (at least that's what I do per the
provided log) resource role continually flapping due to
a non-intermittent/deterministic obstacle on one of the nodes):

  nodes for simplicity: A, B
  only the M/S resource in question is considered
  notation: [node, state of the resource at hand]

  1. [A, master], [B, slave]

  2. for some reason, A fails (spotted with monitor)

  3. attempt demoting A, for likely the same reason, it also fails,
     note that the agent itself notices something is pretty fishy
     there(!):
     > WARNING: ss_demote() trying to demote a resource that was
     > not started
     while B gets promoted

  4. [A, stopped], [B, master]

  5. looks like at this point, there's a sort of a deadlock,
     since the missing instance of the resource cannot be
     started anywhere(?)

  5. failure-timeout of 10 seconds expires on A, hence
     that missing resource instance gets started on A again

  6. [A slave, B master]

  7. likely due to score assignments, A gets promoted again

  8. goto 1.

The main problem is that with setting failure-timeout, you make
a contract with the cluster stack that you've taken precautions
to guarantee that the fault on the particular node is very
very likely an intermittent one, not a systemic failure (unless
the resource agent itself is the weakest link, badly reacting
to circumstances etc., but you shall have accounted for that
in the first place, more so with a custom agent), so it's a valid
decision to allow that resource back after a while (using the
common sense, it would be foolish otherwise).

When that's not the case, you'll get what you ask for, forever
unstable looping, due to failure timeout excessive tolerance.

Admittedly again, pacemaker could have at least two levels of
the loop tracking within the resouce live cycle, to possibly
detect such eventually futile attempts without progress (liveness
assurance).  Currently it has only a single tight loop tracking
IIUIC, which appears not enough when further neutralized with
failure-timeout non-implicit setting.

(Private aside: any logging regarding notifications shall only
be available under some log tag not enabled by default; there's
little value in logging that, especially when in-agent logging
is commonly present as well for any effective review of what
the agent actually obtained).

-- 
Jan (Poki)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 819 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20190820/86e42a88/attachment.sig>