[ClusterLabs] VirtualDomain - monitor misses to report & plays up
lejeczek
peljasz at yahoo.co.uk
Sun Apr 11 14:38:53 EDT 2021
Hi guys.
I've experiencing weir "handling" of VirtualDomain by the
cluster. It seems that cluster sometimes fails to report
real state of VM which results sometime in troubles - like
when cluster thinks VM is not running, which is running then
cluster starts it on another node which fcuks up qcow image.
Right now for example I'm looking at cluster report VM is up
& okey while it is not, on none of the nodes (because VM was
'poweroff' from itself)
So I:
-> $ pcs resource refresh c8kubermaster1
Cleaned up c8kubermaster1 on swir
Cleaned up c8kubermaster1 on dzien
Waiting for 2 replies from the controller
... got reply
... got reply (done)
In logs where VM is supposed to be running, according to cluster
..
notice: Requesting local execution of probe operation for
c8kubermaster1 on swir
notice: Result of probe operation for c8kubermaster1 on
swir: ok
notice: Requesting local execution of monitor operation
for c8kubermaster1 on swir
notice: Result of monitor operation for c8kubermaster1 on
swir: ok
, on the second node (2-node cluster) in logs:
..
notice: State transition S_IDLE -> S_POLICY_ENGINE
notice: Ignoring expired c8kubernode1_migrate_to_0 failure
on dzien
notice: * Start c8kubermaster1 ( swir )
notice: Calculated transition 42, saving inputs in
/var/lib/pacemaker/pengine/pe-input-2655.bz2
notice: Initiating monitor operation
c8kubermaster1_monitor_0 on swir
notice: Initiating monitor operation
c8kubermaster1_monitor_0 locally on dzien
notice: Requesting local execution of probe operation for
c8kubermaster1 on dzien
notice: Result of probe operation for c8kubermaster1 on
dzien: not running
notice: Transition 42 aborted by operation
c8kubermaster1_monitor_0 'modify' on swir: Event failed
notice: Transition 42 action 11 (c8kubermaster1_monitor_0
on swir): expected 'not running' but got 'ok'
-> $ pcs resource config c8kubermaster1
Resource: c8kubermaster1 (class=ocf provider=heartbeat
type=VirtualDomain)
Attributes:
config=/var/lib/pacemaker/conf.d/c8kubermaster1.xml
hypervisor=qemu:///system migration_transport=ssh
Meta Attrs: allow-migrate=true failure-timeout=120s
Operations: migrate_from interval=0s timeout=180s
(c8kubermaster1-migrate_from-interval-0s)
migrate_to interval=0s timeout=180s
(c8kubermaster1-migrate_to-interval-0s)
monitor interval=30s
(c8kubermaster1-monitor-interval-30s)
start interval=0s timeout=90s
(c8kubermaster1-start-interval-0s)
stop interval=0s timeout=90s
(c8kubermaster1-stop-interval-0s)
Disable + enable the resource 'fixes' the glitch but,
naturally the obvious question would be - why that is even
allowed to happen?
many thanks, L.
More information about the Users
mailing list