[ClusterLabs] VirtualDomain craziness

Wed Apr 28 10:15:57 EDT 2021

On Wed, 2021-04-28 at 13:40 +0200, Ulrich Windl wrote:
> Hi!
> 
> I just discovered a problem after re-locating the configuration file
> of a running VirtualDomain:
> Cluster wanted to restart VM v14 on h18 due to configuration change
> (which is correct).
> Stop went OK, start failed with error "already running". !!??
> Still the cluster insisted on "recovering" v14 on h18, but stop
> "suceeded" with
> INFO: Configuration file /etc/libvirt/libxl/v14.xml not readable,
> resource considered stopped.
> (the path was the old path,so that part was OK again)
> 
> Then the cluster moved v14 away to h16 (and later from h16 to h19),
> all successful. (which is OK)
> Still the cluster continued despite of "fail-count=1000000" complaing
> on h18:
> Apr 21 15:59:20 h18 pacemaker-schedulerd[7031]:  warning: Forcing
> prm_xen_v14 away from h18 after 1000000 failures (max=1000000)
> 
> But v14 wasn't running on h18 at that time.

Once the threshold is reached, it will log the message every transition
(until the failure is cleared).

Past events affect current decisions, so each transition will log any
essential information about why a resource is where it is, regardless
of when that event happened.

> The messages continue up to now...
> 
> That's absolutely not OK.
> 
> Seen in SLES15 SP2 with pacemaker-2.0.4+20200616.2deceaa3a-
> 3.3.1.21516.1.PTF.1182607.x86_64
> resource-agents-4.4.0+git57.70549516-3.12.1.x86_64
> 
> Regards,
> Ulrich
> 
> 
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
> 
-- 
Ken Gaillot <kgaillot at redhat.com>