[ClusterLabs] VirtualDomain - monitor misses to report & plays up

Ken Gaillot kgaillot at redhat.com
Mon Apr 12 17:40:30 EDT 2021


On Sun, 2021-04-11 at 19:38 +0100, lejeczek wrote:
> Hi guys.
> 
> I've experiencing weir "handling" of VirtualDomain by the 
> cluster. It seems that cluster sometimes fails to report 
> real state of VM which results sometime in troubles - like 
> when cluster thinks VM is not running, which is running then 
> cluster starts it on another node which fcuks up qcow image.
> Right now for example I'm looking at cluster report VM is up 
> & okey while it is not, on none of the nodes (because VM was 
> 'poweroff' from itself)
> So I:
> 
> -> $ pcs resource refresh c8kubermaster1
> Cleaned up c8kubermaster1 on swir
> Cleaned up c8kubermaster1 on dzien
> Waiting for 2 replies from the controller
> ... got reply
> ... got reply (done)
> 
> In logs where VM is supposed to be running, according to cluster
> ..
>   notice: Requesting local execution of probe operation for 
> c8kubermaster1 on swir
>   notice: Result of probe operation for c8kubermaster1 on 
> swir: ok
>   notice: Requesting local execution of monitor operation 
> for c8kubermaster1 on swir
>   notice: Result of monitor operation for c8kubermaster1 on 
> swir: ok
> 
> , on the second node (2-node cluster) in logs:
> ..
>   notice: State transition S_IDLE -> S_POLICY_ENGINE
>   notice: Ignoring expired c8kubernode1_migrate_to_0 failure 
> on dzien
>   notice:  * Start      c8kubermaster1     (          swir )
>   notice: Calculated transition 42, saving inputs in 
> /var/lib/pacemaker/pengine/pe-input-2655.bz2
>   notice: Initiating monitor operation 
> c8kubermaster1_monitor_0 on swir
>   notice: Initiating monitor operation 
> c8kubermaster1_monitor_0 locally on dzien
>   notice: Requesting local execution of probe operation for 
> c8kubermaster1 on dzien
>   notice: Result of probe operation for c8kubermaster1 on 
> dzien: not running
>   notice: Transition 42 aborted by operation 
> c8kubermaster1_monitor_0 'modify' on swir: Event failed
>   notice: Transition 42 action 11 (c8kubermaster1_monitor_0 
> on swir): expected 'not running' but got 'ok'

Up to this point is OK. When you clear history, Pacemaker schedules all
actions that would be needed if the affected resources were stopped. If
the probe finds that they are indeed stopped, then the rest of the
actions can proceed. If the probe finds that they are running, which is
what happened just above here, then the transition is "aborted" and the
scheduler is re-run.

If the cluster started a second instance of the VM, it wasn't from the
above ... are there more logs where the start happens?

> 
> -> $ pcs resource config c8kubermaster1
>   Resource: c8kubermaster1 (class=ocf provider=heartbeat 
> type=VirtualDomain)
>    Attributes: 
> config=/var/lib/pacemaker/conf.d/c8kubermaster1.xml 
> hypervisor=qemu:///system migration_transport=ssh
>    Meta Attrs: allow-migrate=true failure-timeout=120s
>    Operations: migrate_from interval=0s timeout=180s 
> (c8kubermaster1-migrate_from-interval-0s)
>                migrate_to interval=0s timeout=180s 
> (c8kubermaster1-migrate_to-interval-0s)
>                monitor interval=30s 
> (c8kubermaster1-monitor-interval-30s)
>                start interval=0s timeout=90s 
> (c8kubermaster1-start-interval-0s)
>                stop interval=0s timeout=90s 
> (c8kubermaster1-stop-interval-0s)
> 
> Disable + enable the resource 'fixes' the glitch but, 
> naturally the obvious question would be - why that is even 
> allowed to happen?
> many thanks, L.
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
-- 
Ken Gaillot <kgaillot at redhat.com>



More information about the Users mailing list