[ClusterLabs] Master/slave failover does not work as expected

Wed Aug 21 08:48:37 EDT 2019

On 20/08/19 20:55 +0200, Jan Pokorný wrote:
> On 15/08/19 17:03 +0000, Michael Powell wrote:
>> First, thanks to all for their responses.  With your help, I'm
>> steadily gaining competence WRT HA, albeit slowly.
>> 
>> I've basically followed Harvey's workaround suggestion, and the
>> failover I hoped for takes effect quite quickly.  I nevertheless
>> remain puzzled about why our legacy code, based upon Pacemaker
>> 1.0/Heartbeat, works satisfactorily w/o such changes, 
>> 
>> Here's what I've done.  First, based upon responses to my post, I've
>> implemented the following commands when setting up the cluster:
>>> crm_resource  -r SS16201289RN00023 --meta -p resource-stickiness -v 0    # (someone asserted that this was unnecessary)
>>> crm_resource  --meta -r SS16201289RN00023 -p migration-threshold -v 1
>>> crm_resource  -r SS16201289RN00023 --meta -p failure-timeout -v 10
>>> crm_resource  -r SS16201289RN00023 --meta -p cluster-recheck-interval -v 15
>> 
>> In addition, I've added "-l reboot" to those instances where
>> 'crm_master' is invoked by the RA to change resource scores.  I also
>> found a location constraint in our setup that I couldn't understand
>> the need for, and removed it.
>> 
>> After doing this, in my initial tests, I found that after 'kill -9
>> <pid>' was issued to the master, the slave instance on the other
>> node was promoted to master within a few seconds.  However, it took
>> 60 seconds before the killed resource was restarted.  In examining
>> the cib.xml file, I found an "rsc-options-failure-timeout", which
>> was set to "1min".  Thinking "aha!", I added the following line
>>> crm_attribute --type rsc_defaults --name failure-timeout --update 15
>> 
>> Sadly, this does not appear to have had any impact.
> 
> It won't have any impact on your SS16201289RN00023 resource since it
> has it's own (closer, hence overriding such outermost fallback value,
> barring the deferring to built-in default when that default would not
> be set explicitly) failure timeout set to value of 10 seconds per
> above.
> 
> Admittedly, the documentation would use a precize formalization
> of the precedence rules.
> 
> Anyway, this whole seems moot to me, see below.
> 
>> So, while the good news is that the failover occurs as I'd hoped
>> for, the time required to restart the failed resource seems
>> excessive.  Apparently setting the failure-timeout values isn't
>> sufficient.  As an experiment, I issued a "crm_resource -r
>> <resourceid> --cleanup" command shortly after the failover took
>> effect and found that the resource was quickly restarted.  Is that
>> the recommended procedure?  If so, is changing the "failure-timeout"
>> and "cluster-recheck-interval" really necessary?
>> 
>> Finally, for a period of about 20 seconds while the resource is
>> being restarted (it takes about 30s to start up the resource's
>> application), it appears that both nodes are masters.  E.g. here's
>> the 'crm_mon' output.
>> vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
>> [root at mgraid-16201289RN00023-0 bin]# date;crm_mon -1
>> Thu Aug 15 09:44:59 PDT 2019
>> Stack: corosync
>> Current DC: mgraid-16201289RN00023-0 (version 1.1.19-8.el7-c3c624ea3d) - partition with quorum
>> Last updated: Thu Aug 15 09:44:59 2019
>> Last change: Thu Aug 15 06:26:39 2019 by root via crm_attribute on mgraid-16201289RN00023-0
>> 
>> 2 nodes configured
>> 4 resources configured
>> 
>> Online: [ mgraid-16201289RN00023-0 mgraid-16201289RN00023-1 ]
>> 
>> Active resources:
>> 
>>  Clone Set: mgraid-stonith-clone [mgraid-stonith]
>>      Started: [ mgraid-16201289RN00023-0 mgraid-16201289RN00023-1 ]
>>  Master/Slave Set: ms-SS16201289RN00023 [SS16201289RN00023]
>>      Masters: [ mgraid-16201289RN00023-0 mgraid-16201289RN00023-1 ]
>> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>> This seems odd.  Eventually the just-started resource reverts to a slave, but this doesn't make sense to me.
>> 
>> For the curious, I've attached the 'cibadmin --query' output from
>> the DC node, taken just prior to issuing the 'kill -9' command, as
>> well as the corosync.log file from the DC following the 'kill -9'
>> command.
> 
> Thanks for attaching both, since your above described situation starts
> to make a bit more sense to me.
> 
> I think you've observed (at least that's what I do per the
> provided log) resource role continually flapping due to
> a non-intermittent/deterministic obstacle on one of the nodes):
> 
>   nodes for simplicity: A, B
>   only the M/S resource in question is considered
>   notation: [node, state of the resource at hand]
>   
>   1. [A, master], [B, slave]
> 
>   2. for some reason, A fails (spotted with monitor)
> 
>   3. attempt demoting A, for likely the same reason, it also fails,
>      note that the agent itself notices something is pretty fishy
>      there(!):
>      > WARNING: ss_demote() trying to demote a resource that was
>      > not started
>      while B gets promoted

Note that the agent's monitor operation shall also likely fail
with OCF_FAILED_MASTER when master role is assumed rather than
OCF_NOT_RUNNING (even though the pacemaker's response appear the
same as documented for the former, with demoting first, nonetheless).
Yet again, it's admittedly terribly underspecified which exact
responses are expected at which state of resource instance's live
cycle, also owing to the fact this is originally a "proprietary"
extension over plain OCF.

>   4. [A, stopped], [B, master]
> 
>   5. looks like at this point, there's a sort of a deadlock,
>      since the missing instance of the resource cannot be
>      started anywhere(?)
> 
>   6. failure-timeout of 10 seconds expires on A, hence
>      that missing resource instance gets started on A again
> 
>   7. [A slave, B master]
> 
>   8. likely due to score assignments, A gets promoted again
> 
>   9. goto 1.
> 
> The main problem is that with setting failure-timeout, you make
> a contract with the cluster stack that you've taken precautions
> to guarantee that the fault on the particular node is very
> very likely an intermittent one, not a systemic failure (unless
> the resource agent itself is the weakest link, badly reacting
> to circumstances etc., but you shall have accounted for that
> in the first place, more so with a custom agent), so it's a valid
> decision to allow that resource back after a while (using the
> common sense, it would be foolish otherwise).
> 
> When that's not the case, you'll get what you ask for, forever
> unstable looping, due to failure timeout excessive tolerance.
> 
> Admittedly again, pacemaker could have at least two levels of
> the loop tracking within the resouce live cycle, to possibly
> detect such eventually futile attempts without progress (liveness
> assurance).  Currently it has only a single tight loop tracking
> IIUIC, which appears not enough when further neutralized with
> failure-timeout non-implicit setting.
> 
> (Private aside: any logging regarding notifications shall only
> be available under some log tag not enabled by default; there's
> little value in logging that, especially when in-agent logging
> is commonly present as well for any effective review of what
> the agent actually obtained).

-- 
Jan (Poki)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 819 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20190821/bb78387b/attachment-0001.sig>