[ClusterLabs] Master/slave failover does not work as expected

Wed Aug 21 10:48:50 EDT 2019

On 21/08/19 14:48 +0200, Jan Pokorný wrote:
> On 20/08/19 20:55 +0200, Jan Pokorný wrote:
>> On 15/08/19 17:03 +0000, Michael Powell wrote:
>>> First, thanks to all for their responses.  With your help, I'm
>>> steadily gaining competence WRT HA, albeit slowly.
>>> 
>>> I've basically followed Harvey's workaround suggestion, and the
>>> failover I hoped for takes effect quite quickly.  I nevertheless
>>> remain puzzled about why our legacy code, based upon Pacemaker
>>> 1.0/Heartbeat, works satisfactorily w/o such changes, 
>>> 
>>> Here's what I've done.  First, based upon responses to my post, I've
>>> implemented the following commands when setting up the cluster:
>>>> crm_resource  -r SS16201289RN00023 --meta -p resource-stickiness -v 0    # (someone asserted that this was unnecessary)
>>>> crm_resource  --meta -r SS16201289RN00023 -p migration-threshold -v 1
>>>> crm_resource  -r SS16201289RN00023 --meta -p failure-timeout -v 10
>>>> crm_resource  -r SS16201289RN00023 --meta -p cluster-recheck-interval -v 15
>>> 
>>> In addition, I've added "-l reboot" to those instances where
>>> 'crm_master' is invoked by the RA to change resource scores.  I also
>>> found a location constraint in our setup that I couldn't understand
>>> the need for, and removed it.
>>> 
>>> After doing this, in my initial tests, I found that after 'kill -9
>>> <pid>' was issued to the master, the slave instance on the other
>>> node was promoted to master within a few seconds.  However, it took
>>> 60 seconds before the killed resource was restarted.  In examining
>>> the cib.xml file, I found an "rsc-options-failure-timeout", which
>>> was set to "1min".  Thinking "aha!", I added the following line
>>>> crm_attribute --type rsc_defaults --name failure-timeout --update 15
>>> 
>>> Sadly, this does not appear to have had any impact.
>> 
>> It won't have any impact on your SS16201289RN00023 resource since it
>> has it's own (closer, hence overriding such outermost fallback value,
>> barring the deferring to built-in default when that default would not
>> be set explicitly) failure timeout set to value of 10 seconds per
>> above.
>> 
>> Admittedly, the documentation would use a precize formalization
>> of the precedence rules.
>> 
>> Anyway, this whole seems moot to me, see below.
>> 
>>> So, while the good news is that the failover occurs as I'd hoped
>>> for, the time required to restart the failed resource seems
>>> excessive.  Apparently setting the failure-timeout values isn't
>>> sufficient.  As an experiment, I issued a "crm_resource -r
>>> <resourceid> --cleanup" command shortly after the failover took
>>> effect and found that the resource was quickly restarted.  Is that
>>> the recommended procedure?  If so, is changing the "failure-timeout"
>>> and "cluster-recheck-interval" really necessary?
>>> 
>>> Finally, for a period of about 20 seconds while the resource is
>>> being restarted (it takes about 30s to start up the resource's
>>> application), it appears that both nodes are masters.  E.g. here's
>>> the 'crm_mon' output.
>>> vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
>>> [root at mgraid-16201289RN00023-0 bin]# date;crm_mon -1
>>> Thu Aug 15 09:44:59 PDT 2019
>>> Stack: corosync
>>> Current DC: mgraid-16201289RN00023-0 (version 1.1.19-8.el7-c3c624ea3d) - partition with quorum
>>> Last updated: Thu Aug 15 09:44:59 2019
>>> Last change: Thu Aug 15 06:26:39 2019 by root via crm_attribute on mgraid-16201289RN00023-0
>>> 
>>> 2 nodes configured
>>> 4 resources configured
>>> 
>>> Online: [ mgraid-16201289RN00023-0 mgraid-16201289RN00023-1 ]
>>> 
>>> Active resources:
>>> 
>>>  Clone Set: mgraid-stonith-clone [mgraid-stonith]
>>>      Started: [ mgraid-16201289RN00023-0 mgraid-16201289RN00023-1 ]
>>>  Master/Slave Set: ms-SS16201289RN00023 [SS16201289RN00023]
>>>      Masters: [ mgraid-16201289RN00023-0 mgraid-16201289RN00023-1 ]
>>> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>> This seems odd.  Eventually the just-started resource reverts to a slave, but this doesn't make sense to me.
>>> 
>>> For the curious, I've attached the 'cibadmin --query' output from
>>> the DC node, taken just prior to issuing the 'kill -9' command, as
>>> well as the corosync.log file from the DC following the 'kill -9'
>>> command.
>> 
>> Thanks for attaching both, since your above described situation starts
>> to make a bit more sense to me.
>> 
>> I think you've observed (at least that's what I do per the
>> provided log) resource role continually flapping due to
>> a non-intermittent/deterministic obstacle on one of the nodes):
>> 
>>   nodes for simplicity: A, B
>>   only the M/S resource in question is considered
>>   notation: [node, state of the resource at hand]
>>   
>>   1. [A, master], [B, slave]
>> 
>>   2. for some reason, A fails (spotted with monitor)
>> 
>>   3. attempt demoting A, for likely the same reason, it also fails,
>>      note that the agent itself notices something is pretty fishy
>>      there(!):
>>      > WARNING: ss_demote() trying to demote a resource that was
>>      > not started
>>      while B gets promoted
> 
> Note that the agent's monitor operation shall also likely fail
> with OCF_FAILED_MASTER when master role is assumed rather than
> OCF_NOT_RUNNING (even though the pacemaker's response appear the
> same as documented for the former, with demoting first, nonetheless).
> Yet again, it's admittedly terribly underspecified which exact
> responses are expected at which state of resource instance's live
> cycle, also owing to the fact this is originally a "proprietary"
> extension over plain OCF.

Sorry, forgot that -- counterintuitively -- the details are at
the place of API convention dependants (rather than of convention
API exerciser):

https://github.com/ClusterLabs/resource-agents/blob/master/doc/dev-guides/ra-dev-guide.asc#ocf_not_running-7

Apparently, in this case, the resource was _not_ "gracefully shut
down" not "has never been started" applies (if I am not missing
anything, that is).  So returning OCF_NOT_RUNNING was inappropriate.
Would make sure the agent is conforming to all these rules before
investigating further.

>>   4. [A, stopped], [B, master]
>> 
>>   5. looks like at this point, there's a sort of a deadlock,
>>      since the missing instance of the resource cannot be
>>      started anywhere(?)
>> 
>>   6. failure-timeout of 10 seconds expires on A, hence
>>      that missing resource instance gets started on A again
>> 
>>   7. [A slave, B master]
>> 
>>   8. likely due to score assignments, A gets promoted again
>> 
>>   9. goto 1.
>> 
>> The main problem is that with setting failure-timeout, you make
>> a contract with the cluster stack that you've taken precautions
>> to guarantee that the fault on the particular node is very
>> very likely an intermittent one, not a systemic failure (unless
>> the resource agent itself is the weakest link, badly reacting
>> to circumstances etc., but you shall have accounted for that
>> in the first place, more so with a custom agent), so it's a valid
>> decision to allow that resource back after a while (using the
>> common sense, it would be foolish otherwise).
>> 
>> When that's not the case, you'll get what you ask for, forever
>> unstable looping, due to failure timeout excessive tolerance.
>> 
>> Admittedly again, pacemaker could have at least two levels of
>> the loop tracking within the resouce live cycle, to possibly
>> detect such eventually futile attempts without progress (liveness
>> assurance).  Currently it has only a single tight loop tracking
>> IIUIC, which appears not enough when further neutralized with
>> failure-timeout non-implicit setting.
>> 
>> (Private aside: any logging regarding notifications shall only
>> be available under some log tag not enabled by default; there's
>> little value in logging that, especially when in-agent logging
>> is commonly present as well for any effective review of what
>> the agent actually obtained).

-- 
Jan (Poki)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 819 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20190821/7aa10880/attachment.sig>