[ClusterLabs] Antw: [EXT] VirtualDomain - monitor misses to report & plays up

Mon Apr 12 03:39:54 EDT 2021

>>> lejeczek <peljasz at yahoo.co.uk> schrieb am 11.04.2021 um 20:38 in Nachricht
<ea3fa40a-1108-b441-8e70-b9dac12019c9 at yahoo.co.uk>:
> Hi guys.
> 
> I've experiencing weir "handling" of VirtualDomain by the 
> cluster. It seems that cluster sometimes fails to report 
> real state of VM which results sometime in troubles - like 
> when cluster thinks VM is not running, which is running then 
> cluster starts it on another node which fcuks up qcow image.

Hi!

See my earlier messages on that topic as well. Usually those bad assumptions are consequences of other failures happening before.
I'm not saying that's an excuse, however.

Regards,
Ulrich

> Right now for example I'm looking at cluster report VM is up 
> & okey while it is not, on none of the nodes (because VM was 
> 'poweroff' from itself)
> So I:
> 
> -> $ pcs resource refresh c8kubermaster1
> Cleaned up c8kubermaster1 on swir
> Cleaned up c8kubermaster1 on dzien
> Waiting for 2 replies from the controller
> ... got reply
> ... got reply (done)
> 
> In logs where VM is supposed to be running, according to cluster
> ..
>   notice: Requesting local execution of probe operation for 
> c8kubermaster1 on swir
>   notice: Result of probe operation for c8kubermaster1 on 
> swir: ok
>   notice: Requesting local execution of monitor operation 
> for c8kubermaster1 on swir
>   notice: Result of monitor operation for c8kubermaster1 on 
> swir: ok
> 
> , on the second node (2-node cluster) in logs:
> ..
>   notice: State transition S_IDLE -> S_POLICY_ENGINE
>   notice: Ignoring expired c8kubernode1_migrate_to_0 failure 
> on dzien
>   notice:  * Start      c8kubermaster1     (          swir )
>   notice: Calculated transition 42, saving inputs in 
> /var/lib/pacemaker/pengine/pe-input-2655.bz2
>   notice: Initiating monitor operation 
> c8kubermaster1_monitor_0 on swir
>   notice: Initiating monitor operation 
> c8kubermaster1_monitor_0 locally on dzien
>   notice: Requesting local execution of probe operation for 
> c8kubermaster1 on dzien
>   notice: Result of probe operation for c8kubermaster1 on 
> dzien: not running
>   notice: Transition 42 aborted by operation 
> c8kubermaster1_monitor_0 'modify' on swir: Event failed
>   notice: Transition 42 action 11 (c8kubermaster1_monitor_0 
> on swir): expected 'not running' but got 'ok'
> 
> -> $ pcs resource config c8kubermaster1
>   Resource: c8kubermaster1 (class=ocf provider=heartbeat 
> type=VirtualDomain)
>    Attributes: 
> config=/var/lib/pacemaker/conf.d/c8kubermaster1.xml 
> hypervisor=qemu:///system migration_transport=ssh
>    Meta Attrs: allow-migrate=true failure-timeout=120s
>    Operations: migrate_from interval=0s timeout=180s 
> (c8kubermaster1-migrate_from-interval-0s)
>                migrate_to interval=0s timeout=180s 
> (c8kubermaster1-migrate_to-interval-0s)
>                monitor interval=30s 
> (c8kubermaster1-monitor-interval-30s)
>                start interval=0s timeout=90s 
> (c8kubermaster1-start-interval-0s)
>                stop interval=0s timeout=90s 
> (c8kubermaster1-stop-interval-0s)
> 
> Disable + enable the resource 'fixes' the glitch but, 
> naturally the obvious question would be - why that is even 
> allowed to happen?
> many thanks, L.
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/