[ClusterLabs] Qemu VM resources - cannot acquire state change lock
lejeczek
peljasz at yahoo.co.uk
Sat Aug 28 06:33:14 EDT 2021
On 26/08/2021 10:35, Klaus Wenninger wrote:
>
>
> On Thu, Aug 26, 2021 at 11:13 AM lejeczek via Users
> <users at clusterlabs.org <mailto:users at clusterlabs.org>> wrote:
>
> Hi guys.
>
> I sometimes - I think I know when in terms of any
> pattern -
> get resources stuck on one node (two-node cluster) with
> these in libvirtd's logs:
> ...
> Cannot start job (query, none, none) for domain
> c8kubermaster1; current job is (modify, none, none)
> owned by
> (192261 qemuProcessReconnect, 0 <null>, 0 <null>
> (flags=0x0)) for (1093s, 0s, 0s)
> Cannot start job (query, none, none) for domain
> ubuntu-tor;
> current job is (modify, none, none) owned by (192263
> qemuProcessReconnect, 0 <null>, 0 <null> (flags=0x0)) for
> (1093s, 0s, 0s)
> Timed out during operation: cannot acquire state
> change lock
> (held by monitor=qemuProcessReconnect)
> Timed out during operation: cannot acquire state
> change lock
> (held by monitor=qemuProcessReconnect)
> ...
>
> when this happens, and if the resourec is meant to be the
> other node, I have to to disable the resource first, then
> the node on which resources are stuck will shutdown
> the VM
> and then I have to re-enable that resource so it
> would, only
> then, start on that other, the second node.
>
> I think this problem occurs if I restart 'libvirtd'
> via systemd.
>
> Any thoughts on this guys?
>
>
> What are the logs on the pacemaker-side saying?
> An issue with migration?
>
> Klaus
I'll have to try to tidy up the "protocol" with my stuff so
I could call it all reproducible, at the moment if only
feels that way, as reproducible.
I'm on CentOS Stream and have 2-node cluster, with KVM
resources, with same glusterfs cluster 2-node. (all
psychically is two machines)
1) I power down one node in orderly manner and the other
node is last-man-standing.
2) after a while (not sure if time period is also a key
here) I brought up that first node.
3) the last man-standing-node libvirtd becomes irresponsive
(don't know yet, if that is only after the first node came
back up) to virt cmd and to probably everything else,
pacameker log says:
...
pacemaker-controld[2730]: error: Result of probe operation
for c8kubernode2 on dzien: Timed Out
...
and libvirtd log does not say anything really (with default
debug levels)
4) if glusterfs might play any role? Healing of the
volume(s) is finished at this time, completed successfully.
This the moment where I would manually 'systemd restart
libvirtd' that irresponsive node(was last-man-standing) and
got original error messages.
There is plenty of room for anybody to make guesses, obvious.
Is it 'libvirtd' going haywire because glusterfs volume is
in an unhealthy state and needs healing?
Is it pacemaker last-man-standing which makes 'libvirtd' go
haywire?
etc...
I can add much concrete stuff at this moment but will
appreciate any thoughts you want to share.
thanks, L
> many thanks, L.
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> <https://lists.clusterlabs.org/mailman/listinfo/users>
>
> ClusterLabs home: https://www.clusterlabs.org/
> <https://www.clusterlabs.org/>
>
More information about the Users
mailing list