[ClusterLabs] crm resource stop VirtualDomain - but VirtualDomain shutdown start some minutes later

Wed Feb 16 12:48:50 EST 2022

On 16.02.2022 14:35, Lentes, Bernd wrote:
> 
> 
> ----- On Feb 16, 2022, at 12:52 AM, kgaillot kgaillot at redhat.com wrote:
> 
> 
>>> Any idea ?
>>> What is about that transition 128, which is aborted ?
>>
>> A transition is the set of actions that need to be taken in response to
>> current conditions. A transition is aborted any time conditions change
>> (here, the target-role being changed in the configuration), so that a
>> new set of actions can be calculated.
>>
>> Someone once defined a transition as an "action plan", and I'm tempted
>> to use that instead. Plus maybe replace "aborted" with "interrupted",
>> so then we'd have "Action plan interrupted" which is maybe a little
>> more understandable.
>>
>>>
>>> Transition 128 is finished:
>>> Feb 15 21:04:26 [15370] ha-idg-2       crmd:   notice:
>>> run_graph:       Transition 128 (Complete=1, Pending=0, Fired=0,
>>> Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-
>>> 3548.bz2): Complete
>>>
>>> And one second later the shutdown starts. Is that normal that there
>>> is such a big time gap ?
>>>
>>
>> No, there should be another transition calculated (with a "saving
>> input" message) immediately after the original transition is aborted.
>> What's the timestamp on that?
>> --
> 
> Hi Ken,
> 
> this is what i found:
> 
> Feb 15 20:54:30 [15369] ha-idg-2    pengine:   notice: process_pe_message:      Calculated transition 128, saving inputs in /var/lib/pacemaker/pengine/pe-input-3548.bz2
> Feb 15 20:54:30 [15370] ha-idg-2       crmd:     info: do_state_transition:     State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE | input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response
> Feb 15 20:54:30 [15370] ha-idg-2       crmd:   notice: do_te_invoke:    Processing graph 128 (ref=pe_calc-dc-1644954870-403) derived from /var/lib/pacemaker/pengine/pe-input-3548.bz2
> Feb 15 20:54:30 [15370] ha-idg-2       crmd:   notice: te_rsc_command:  Initiating stop operation vm_pathway_stop_0 locally on ha-idg-2 | action 76
> 
> Feb 15 21:04:26 [15369] ha-idg-2    pengine:   notice: process_pe_message:      Calculated transition 129, saving inputs in /var/lib/pacemaker/pengine/pe-input-3549.bz2
> Feb 15 21:04:26 [15370] ha-idg-2       crmd:     info: do_state_transition:     State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE | input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response
> Feb 15 21:04:26 [15370] ha-idg-2       crmd:   notice: do_te_invoke:    Processing graph 129 (ref=pe_calc-dc-1644955466-405) derived from /var/lib/pacemaker/pengine/pe-input-3549.bz2
> 

Splitting logs between different messages does not really help in interpreting them.

I guess the real question here is why "Transition aborted" is logged although
transition apparently continues. Transition 128 started at 20:54:30 and completed
at 21:04:26, but there were multiple "Transition 128 aborted" messages in between
(unfortunately one needs now to hunt for another mail to put them together).

It looks like "Transition aborted" is more "we try to abort this transition if
possible". My guess is that pacemaker must wait for currently running action(s)
which can take quite some time when stopping virtual domain. Transition 128
was initiated when stopping vm_pathway, but we have no idea when it was stopped.