[ClusterLabs] Qemu VM resources - cannot acquire state change lock

Sat Aug 28 06:33:14 EDT 2021

On 26/08/2021 10:35, Klaus Wenninger wrote:
>
>
> On Thu, Aug 26, 2021 at 11:13 AM lejeczek via Users 
> <users at clusterlabs.org <mailto:users at clusterlabs.org>> wrote:
>
>     Hi guys.
>
>     I sometimes - I think I know when in terms of any
>     pattern -
>     get resources stuck on one node (two-node cluster) with
>     these in libvirtd's logs:
>     ...
>     Cannot start job (query, none, none) for domain
>     c8kubermaster1; current job is (modify, none, none)
>     owned by
>     (192261 qemuProcessReconnect, 0 <null>, 0 <null>
>     (flags=0x0)) for (1093s, 0s, 0s)
>     Cannot start job (query, none, none) for domain
>     ubuntu-tor;
>     current job is (modify, none, none) owned by (192263
>     qemuProcessReconnect, 0 <null>, 0 <null> (flags=0x0)) for
>     (1093s, 0s, 0s)
>     Timed out during operation: cannot acquire state
>     change lock
>     (held by monitor=qemuProcessReconnect)
>     Timed out during operation: cannot acquire state
>     change lock
>     (held by monitor=qemuProcessReconnect)
>     ...
>
>     when this happens, and if the resourec is meant to be the
>     other node, I have to to disable the resource first, then
>     the node on which resources are stuck will shutdown
>     the VM
>     and then I have to re-enable that resource so it
>     would, only
>     then, start on that other, the second node.
>
>     I think this problem occurs if I restart 'libvirtd'
>     via systemd.
>
>     Any thoughts on this guys?
>
>
> What are the logs on the pacemaker-side saying?
> An issue with migration?
>
> Klaus

I'll have to try to tidy up the "protocol" with my stuff so 
I could call it all reproducible, at the moment if only 
feels that way, as reproducible.

I'm on CentOS Stream and have 2-node cluster, with KVM 
resources, with same glusterfs cluster 2-node. (all 
psychically is two machines)

1) I power down one node in orderly manner and the other 
node is last-man-standing.
2) after a while (not sure if time period is also a key 
here) I brought up that first node.
3) the last man-standing-node libvirtd becomes irresponsive 
(don't know yet, if that is only after the first node came 
back up) to virt cmd and to probably everything else, 
pacameker log says:
...
pacemaker-controld[2730]:  error: Result of probe operation 
for c8kubernode2 on dzien: Timed Out
...
and libvirtd log does not say anything really (with default 
debug levels)

4) if glusterfs might play any role? Healing of the 
volume(s) is finished at this time, completed successfully.

This the moment where I would manually 'systemd restart 
libvirtd' that irresponsive node(was last-man-standing) and 
got original error messages.

There is plenty of room for anybody to make guesses, obvious.
Is it 'libvirtd' going haywire because glusterfs volume is 
in an unhealthy state and needs healing?
Is it pacemaker last-man-standing which makes 'libvirtd' go 
haywire?
etc...

I can add much concrete stuff at this moment but will 
appreciate any thoughts you want to share.
thanks, L

>     many thanks, L.
>     _______________________________________________
>     Manage your subscription:
>     https://lists.clusterlabs.org/mailman/listinfo/users
>     <https://lists.clusterlabs.org/mailman/listinfo/users>
>
>     ClusterLabs home: https://www.clusterlabs.org/
>     <https://www.clusterlabs.org/>
>