[ClusterLabs] VirtualDomain - monitor misses to report & plays up

Sun Apr 11 14:38:53 EDT 2021

Hi guys.

I've experiencing weir "handling" of VirtualDomain by the 
cluster. It seems that cluster sometimes fails to report 
real state of VM which results sometime in troubles - like 
when cluster thinks VM is not running, which is running then 
cluster starts it on another node which fcuks up qcow image.
Right now for example I'm looking at cluster report VM is up 
& okey while it is not, on none of the nodes (because VM was 
'poweroff' from itself)
So I:

-> $ pcs resource refresh c8kubermaster1
Cleaned up c8kubermaster1 on swir
Cleaned up c8kubermaster1 on dzien
Waiting for 2 replies from the controller
... got reply
... got reply (done)

In logs where VM is supposed to be running, according to cluster
..
  notice: Requesting local execution of probe operation for 
c8kubermaster1 on swir
  notice: Result of probe operation for c8kubermaster1 on 
swir: ok
  notice: Requesting local execution of monitor operation 
for c8kubermaster1 on swir
  notice: Result of monitor operation for c8kubermaster1 on 
swir: ok

, on the second node (2-node cluster) in logs:
..
  notice: State transition S_IDLE -> S_POLICY_ENGINE
  notice: Ignoring expired c8kubernode1_migrate_to_0 failure 
on dzien
  notice:  * Start      c8kubermaster1     (          swir )
  notice: Calculated transition 42, saving inputs in 
/var/lib/pacemaker/pengine/pe-input-2655.bz2
  notice: Initiating monitor operation 
c8kubermaster1_monitor_0 on swir
  notice: Initiating monitor operation 
c8kubermaster1_monitor_0 locally on dzien
  notice: Requesting local execution of probe operation for 
c8kubermaster1 on dzien
  notice: Result of probe operation for c8kubermaster1 on 
dzien: not running
  notice: Transition 42 aborted by operation 
c8kubermaster1_monitor_0 'modify' on swir: Event failed
  notice: Transition 42 action 11 (c8kubermaster1_monitor_0 
on swir): expected 'not running' but got 'ok'

-> $ pcs resource config c8kubermaster1
  Resource: c8kubermaster1 (class=ocf provider=heartbeat 
type=VirtualDomain)
   Attributes: 
config=/var/lib/pacemaker/conf.d/c8kubermaster1.xml 
hypervisor=qemu:///system migration_transport=ssh
   Meta Attrs: allow-migrate=true failure-timeout=120s
   Operations: migrate_from interval=0s timeout=180s 
(c8kubermaster1-migrate_from-interval-0s)
               migrate_to interval=0s timeout=180s 
(c8kubermaster1-migrate_to-interval-0s)
               monitor interval=30s 
(c8kubermaster1-monitor-interval-30s)
               start interval=0s timeout=90s 
(c8kubermaster1-start-interval-0s)
               stop interval=0s timeout=90s 
(c8kubermaster1-stop-interval-0s)

Disable + enable the resource 'fixes' the glitch but, 
naturally the obvious question would be - why that is even 
allowed to happen?
many thanks, L.