[ClusterLabs] Resources not monitored in SLES11 SP4 (1.1.12-f47ea56)

Tue Jun 26 16:22:08 UTC 2018

On Tue, 2018-06-26 at 10:45 +0300, Vladislav Bogdanov wrote:
> 26.06.2018 09:14, Ulrich Windl wrote:
> > Hi!
> > 
> > We just observed some strange effect we cannot explain in SLES 11
> > SP4 (pacemaker 1.1.12-f47ea56):
> > We run about a dozen of Xen PVMs on a three-node cluster (plus some
> > infrastructure and monitoring stuff). It worked all well so far,
> > and there was no significant change recently.
> > However when a colleague stopped on VM for maintenance via cluster
> > command, the cluster did not notice when the PVM actually was
> > running again (it had been started not using the cluster (a bad
> > idea, I know)).
> 
> To be on a safe side in such cases you'd probably want to enable 
> additional monitor for a "Stopped" role. Default one covers only 
> "Started" role. The same thing as for multistate resources, where
> you 
> need several monitor ops, for "Started/Slave" and "Master" roles.
> But, this will increase a load.
> And, I believe cluster should reprobe a resource on all nodes once
> you 
> change target-role back to "Started".

Which raises the question, how did you stop the VM initially?

If you stopped it by setting target-role to Stopped, likely the cluster
still thinks it's stopped, and you need to set it to Started again. If
instead you set maintenance mode or unmanaged the resource, then
stopped the VM manually, then most likely it's still in that mode and
needs to be taken out of it.

> 
> > Examining the logs, it seems that the recheck timer popped
> > periodically, but no monitor action was run for the VM (the action
> > is configured to run every 10 minutes).
> > 
> > Actually the only monitor operations found were:
> > May 23 08:04:13
> > Jun 13 08:13:03
> > Jun 25 09:29:04
> > Then a manual "reprobe" was done, and several monitor operations
> > were run.
> > Then again I see no more monitor actions in syslog.
> > 
> > What could be the reasons for this? Too many operations defined?
> > 
> > The other message I don't understand is like "<other-resource>:
> > Rolling back scores from <vm-resource>"
> > 
> > Could it be a new bug introduced in pacemaker, or could it be some
> > configuration problem (The status is completely clean however)?
> > 
> > According to the packet changelog, there was no change since Nov
> > 2016...
> > 
> > Regards,
> > Ulrich
> > 
> > 
> > _______________________________________________
> > Users mailing list: Users at clusterlabs.org
> > https://lists.clusterlabs.org/mailman/listinfo/users
> > 
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratc
> > h.pdf
> > Bugs: http://bugs.clusterlabs.org
> > 
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.
> pdf
> Bugs: http://bugs.clusterlabs.org
-- 
Ken Gaillot <kgaillot at redhat.com>