[ClusterLabs] Antw: [EXT] VirtualDomain restart caused fencing.
Ulrich.Windl at rz.uni-regensburg.de
Thu Jul 1 02:19:08 EDT 2021
>>> Matthew Schumacher <matt.s at aptalaska.net> schrieb am 30.06.2021 um 17:40 in
Nachricht <50a4af9f-23c3-5913-1ffa-6503792da20a at aptalaska.net>:
> I'm not sure how to fix this, but calling 'crm resource restart vm-name'
> this morning caused an entire node to get fenced, kicking the stool out from
> under a number of VMs.
> Looking at VirtualDomain it looks like the system defaults to a 90s timeout,
> and if it can't gracefully shutdown the VM with 'virsh shutdown' in 85s, then
> it calls 'virsh destroy'. For whatever reason, that's not what happened.
> I created a mockup where I moved a test vm to it's own node (in case it gets
> fenced), then loaded something that would ignore acpi shutdown, then called
> restart. This time it worked. The logs show:
> Jun 30 15:32:11 VirtualDomain(vm-testvm): INFO: Issuing graceful
> shutdown request for domain testvm.
> Jun 30 15:32:26 VirtualDomain(vm-testvm): INFO: Issuing forced
> shutdown (destroy) request for domain testvm.
> I don't have the logs from the original failure due to my node not being
> persistent, but I wonder if anyone else has run into this.
Typically it's a stop timeout. You should capture the logs!
In general coleccting the syslog of all cluster nodes to one node outside of the cluster may be a valuable help for debugging cluster problems (while admitting we still don't have that, but I'm working on it ;-)).
> Here is my resource configuration if that reveals the issue:
> crm configure primitive vm-testvm2 VirtualDomain params
> config="/datastore/vm/testvm/testvm.xml" migration_transport=ssh meta
> allow-migrate=true target-role=Started op monitor timeout=30 interval=30
A timeout on /datastore might trigger that as well.
> Oh, one last question: Can I disable fencing for a specific resource for
> testing reasons? I'd love to watch this break without fear of fencing.
More information about the Users