[ClusterLabs] VirtualDomain - monitor misses to report & plays up

Mon Apr 12 01:17:49 EDT 2021

On 11.04.2021 21:38, lejeczek wrote:
> Hi guys.
> 
> I've experiencing weir "handling" of VirtualDomain by the cluster. It
> seems that cluster sometimes fails to report real state of VM which
> results sometime in troubles - like when cluster thinks VM is not
> running, which is running then cluster starts it on another node which
> fcuks up qcow image.
> Right now for example I'm looking at cluster report VM is up & okey
> while it is not, on none of the nodes (because VM was 'poweroff' from
> itself)
> So I:
> 
> -> $ pcs resource refresh c8kubermaster1
> Cleaned up c8kubermaster1 on swir
> Cleaned up c8kubermaster1 on dzien
> Waiting for 2 replies from the controller
> ... got reply
> ... got reply (done)
> 
> In logs where VM is supposed to be running, according to cluster
> ..
>  notice: Requesting local execution of probe operation for
> c8kubermaster1 on swir
>  notice: Result of probe operation for c8kubermaster1 on swir: ok
>  notice: Requesting local execution of monitor operation for
> c8kubermaster1 on swir
>  notice: Result of monitor operation for c8kubermaster1 on swir: ok
> 
> , on the second node (2-node cluster) in logs:
> ..
>  notice: State transition S_IDLE -> S_POLICY_ENGINE
>  notice: Ignoring expired c8kubernode1_migrate_to_0 failure on dzien
>  notice:  * Start      c8kubermaster1     (          swir )
>  notice: Calculated transition 42, saving inputs in
> /var/lib/pacemaker/pengine/pe-input-2655.bz2
>  notice: Initiating monitor operation c8kubermaster1_monitor_0 on swir
>  notice: Initiating monitor operation c8kubermaster1_monitor_0 locally
> on dzien
>  notice: Requesting local execution of probe operation for
> c8kubermaster1 on dzien
>  notice: Result of probe operation for c8kubermaster1 on dzien: not running
>  notice: Transition 42 aborted by operation c8kubermaster1_monitor_0
> 'modify' on swir: Event failed
>  notice: Transition 42 action 11 (c8kubermaster1_monitor_0 on swir):
> expected 'not running' but got 'ok'
> 

You need to debug whether virsh returns correct information which is
misinterpreted by agent/pacemaker or virsh returns incorrect
information. As far as I can tell, all that VirtualDomain monitor option
does is running "virsh domstate $DOMAIN".

> -> $ pcs resource config c8kubermaster1
>  Resource: c8kubermaster1 (class=ocf provider=heartbeat type=VirtualDomain)
>   Attributes: config=/var/lib/pacemaker/conf.d/c8kubermaster1.xml
> hypervisor=qemu:///system migration_transport=ssh
>   Meta Attrs: allow-migrate=true failure-timeout=120s
>   Operations: migrate_from interval=0s timeout=180s
> (c8kubermaster1-migrate_from-interval-0s)
>               migrate_to interval=0s timeout=180s
> (c8kubermaster1-migrate_to-interval-0s)
>               monitor interval=30s (c8kubermaster1-monitor-interval-30s)
>               start interval=0s timeout=90s
> (c8kubermaster1-start-interval-0s)
>               stop interval=0s timeout=90s
> (c8kubermaster1-stop-interval-0s)
> 
> Disable + enable the resource 'fixes' the glitch but, naturally the
> obvious question would be - why that is even allowed to happen?
> many thanks, L.
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/