[ClusterLabs] crm resource stop VirtualDomain - but VirtualDomain shutdown start some minutes later

Thu Feb 17 10:25:17 EST 2022

On Thu, 2022-02-17 at 14:05 +0100, Lentes, Bernd wrote:
> ----- On Feb 16, 2022, at 6:48 PM, arvidjaar arvidjaar at gmail.com
> wrote:
> > 
> > Splitting logs between different messages does not really help in
> > interpreting
> > them.
> 
> I agree.
> Here is the complete excerpt from the respective time:
> https://nc-mcd.helmholtz-muenchen.de/nextcloud/s/eY8SA8pe4HZBBc8
> 
> > I guess the real question here is why "Transition aborted" is
> > logged although
> > transition apparently continues. Transition 128 started at 20:54:30
> > and
> > completed
> > at 21:04:26, but there were multiple "Transition 128 aborted"
> > messages in
> > between
> 
> That's correct. The shutdown_timeout for the domain is set with 600
> sec. in the CIB.
> The RA says:
> # The "shutdown_timeout" we use here is the operation
> # timeout specified in the CIB, minus 5 seconds
> And between 20:54:30 and 21:04:26 we have very close 595 sec.
> 
> > It looks like "Transition aborted" is more "we try to abort this
> > transition if
> > possible". My guess is that pacemaker must wait for currently
> > running action(s)
> > which can take quite some time when stopping virtual domain.
> > Transition 128
> > was initiated when stopping vm_pathway, but we have no idea when it
> > was stopped.
> 
> We have:
> Feb 15 21:04:26 [15370] ha-idg-2       crmd:   notice:
> run_graph:       Transition 128 (Complete=1, Pending=0, Fired=0,
> Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-
> 3548.bz2): Complete
> 
> and the log from libvirt confirms it:
> /var/log/libvirtd/qemu/vm_pathway.log:
> 2022-02-15T20:04:26.569471Z qemu-system-x86_64: terminating on signal
> 15 from pid 7368 (/usr/sbin/libvirtd)
> 2022-02-15 20:04:26.769+0000: shutting down, reason=destroyed
> 
> Time in libvirt logs is UTC, and in Munich we have currently UTC+1,
> so the time differs in the logs.
> We see that the domain is "switched off" via libvirt exactly at
> 21:04:26.
> 
> So for me the big question is:
> When a transition is happening, and there is a change in the cluster,
> is the transition "aborted"
> (delayed or interrupted would be better) or not ?
> Is this behaviour consistent ? If no, from what does it depend ?
> 
> Bernd

Yes, anytime the DC sees a change that could affect resources, it will
abort the current transition and calculate a new one. Aborting means
not initiating any new actions from the transition -- but any actions
currently in flight must complete before the new transition can be
calculated.

Changes that abort a transition include configuration changes, a node
joining or leaving, an unexpected action result being received, a node
attribute changing, the cluster-recheck-interval passing since the last
transition, or a timer popping for a time-based event (failure timeout,
rule, etc.). I may be forgetting some, but you get the idea.
-- 
Ken Gaillot <kgaillot at redhat.com>