[ClusterLabs] Continuous master monitor failure of a resource in case some other resource is being promoted

Mon Feb 25 22:55:51 EST 2019

26.02.2019 1:08, Ken Gaillot пишет:
> On Mon, 2019-02-25 at 23:00 +0300, Andrei Borzenkov wrote:
>> 25.02.2019 22:36, Andrei Borzenkov пишет:
>>>
>>>> Could you please help me understand:
>>>> 1. Why doesn't pacemaker process the failure of Stateful_Test_2
>>>> resource
>>>> immediately after first failure?
>>
>> I'm still not sure why.
>>
>>> I vaguely remember something about sequential execution mentioned
>>> before
>>> but cannot find details.
>>>
>>>> 2. Why does the monitor failure of Stateful_Test_2 continue even
>>>> after the
>>>> promote of Stateful_Test_1 has been completed? Shouldn't it
>>>> handle
>>>> Stateful_Test_2's failure and take necessary action on it? It
>>>> feels as if
>>>> that particular failure 'event' has been 'dropped' and pengine is
>>>> not even
>>>> aware of the Stateful_Test_2's failure.
>>>>
>>>
>>> Yes. Although crm_mon shows resource as being master on this node,
>>> in
>>> reality resource is left in failed state forever and monitor result
>>> is
>>> simply ignored.
>>>
>>
>> Yes, pacemaker reacts only on result change (more precisely, it tells
>> lrmd to report only the first result and suppress all further
>> consecutive duplicates). As the first report gets lost due to low
>> failure-timeout, this explains what happens.
> 
> That would explain why the first monitor failure is ignored, but after
> the long-running action completes, a new transition should see the
> failure timeout, wipe the resource history, and log a message on the DC
> about "Re-initiated expired calculated failure", at which point the
> cluster should schedule recovery.
> 
> Do the logs show such a message?
> 

Yes, this message appears. Not sure how it changes anything because the
problem is not that pacemaker does not react to subsequent failures, but
that after the first monitor failure lrmd does not report anything to
pacemaker at all.

crmd/lrm_state.c:lrm_state_exec()

    return ((lrmd_t *) lrm_state->conn)->cmds->exec(lrm_state->conn,
                                                    rsc_id,
                                                    action,
                                                    userdata,
                                                    interval,
                                                    timeout,
                                                    start_delay,

lrmd_opt_notify_changes_only, params);

lrmd/lrmd.c:send_cmd_complete_notify()

    /* if the first notify result for a cmd has already been sent
earlier, and the
     * the option to only send notifies on result changes is set. Check
to see
     * if the last result is the same as the new one. If so, suppress
this update */
    if (cmd->first_notify_sent && (cmd->call_opts &
lrmd_opt_notify_changes_only)) {
        if (cmd->last_notify_rc == cmd->exec_rc &&
            cmd->last_notify_op_status == cmd->lrmd_op_status) {

            /* only send changes */
            return;
        }

    }

>From pacemaker point of view there is no resource status change at all.
It never initiated different operation for this resource, so last rc and
op_status never change.

>> ...
>>
>>>>
>>>> Could you please help us in understanding this behavior and how
>>>> to fix this?
>>>>
>>>
>>> Your problem is triggered by too low failure-timeout. Failure of
>>> master
>>> is cleared before pacemaker picks it for processing (or so I
>>> interpret
>>> it). You should set failure-timeout to be longer than your actions
>>> may
>>> take. This will give you at least workaround.
>>>
>>> Note that in your configuration resource cannot be recovered
>>> anyway.
>>> migration-threshold is 1 so pacemaker cannot (try to) restart
>>> master on
>>> the same node but you prohibit running it anywhere else.
>