[ClusterLabs] Antw: Re: Antw: Re: Resources not monitored in SLES11 SP4 (1.1.12-f47ea56)

Ken Gaillot kgaillot at redhat.com
Thu Jun 28 13:49:03 EDT 2018


On Thu, 2018-06-28 at 09:13 +0200, Ulrich Windl wrote:
> > > > Ken Gaillot <kgaillot at redhat.com> schrieb am 27.06.2018 um
> > > > 16:32 in Nachricht
> 
> <1530109926.6452.3.camel at redhat.com>:
> > On Wed, 2018-06-27 at 09:18 -0500, Ken Gaillot wrote:
> > > On Wed, 2018-06-27 at 07:41 +0200, Ulrich Windl wrote:
> > > > > > > Ken Gaillot <kgaillot at redhat.com> schrieb am 26.06.2018
> > > > > > > um
> > > > > > > 18:22 in Nachricht
> > > > 
> > > > <1530030128.5202.5.camel at redhat.com>:
> > > > > On Tue, 2018-06-26 at 10:45 +0300, Vladislav Bogdanov wrote:
> > > > > > 26.06.2018 09:14, Ulrich Windl wrote:
> > > > > > > Hi!
> > > > > > > 
> > > > > > > We just observed some strange effect we cannot explain in
> > > > > > > SLES
> > > > > > > 11
> > > > > > > SP4 (pacemaker 1.1.12-f47ea56):
> > > > > > > We run about a dozen of Xen PVMs on a three-node cluster
> > > > > > > (plus
> > > > > > > some
> > > > > > > infrastructure and monitoring stuff). It worked all well
> > > > > > > so
> > > > > > > far,
> > > > > > > and there was no significant change recently.
> > > > > > > However when a colleague stopped on VM for maintenance
> > > > > > > via
> > > > > > > cluster
> > > > > > > command, the cluster did not notice when the PVM actually
> > > > > > > was
> > > > > > > running again (it had been started not using the cluster
> > > > > > > (a
> > > > > > > bad
> > > > > > > idea, I know)).
> > > > > > 
> > > > > > To be on a safe side in such cases you'd probably want to
> > > > > > enable 
> > > > > > additional monitor for a "Stopped" role. Default one covers
> > > > > > only 
> > > > > > "Started" role. The same thing as for multistate resources,
> > > > > > where
> > > > > > you 
> > > > > > need several monitor ops, for "Started/Slave" and "Master"
> > > > > > roles.
> > > > > > But, this will increase a load.
> > > > > > And, I believe cluster should reprobe a resource on all
> > > > > > nodes
> > > > > > once
> > > > > > you 
> > > > > > change target-role back to "Started".
> > > > > 
> > > > > Which raises the question, how did you stop the VM initially?
> > > > 
> > > > I thought "(...) stopped one VM for maintenance via cluster
> > > > command"
> > > > is obvious. It was something like "crm resource stop ...".
> > > > 
> > > > > 
> > > > > If you stopped it by setting target-role to Stopped, likely
> > > > > the
> > > > > cluster
> > > > > still thinks it's stopped, and you need to set it to Started
> > > > > again.
> > > > > If
> > > > > instead you set maintenance mode or unmanaged the resource,
> > > > > then
> > > > > stopped the VM manually, then most likely it's still in that
> > > > > mode
> > > > > and
> > > > > needs to be taken out of it.
> > > > 
> > > > The point was when the command to start the resource was given,
> > > > the
> > > > cluster had completely ignored the fact that it was running
> > > > already
> > > > and started to start the VM on a second node (which may be
> > > > desastrous). But that's leading away from the main question...
> > > 
> > > Ah, this is expected behavior when you start a resource manually,
> > > and
> > > there are no monitors with target-role=Stopped. If the node where
> > > you
> > > manually started the VM isn't the same node the cluster happens
> > > to
> > > choose, then you can get multiple active instances.
> > > 
> > > By default, the cluster assumes that where a probe found a
> > > resource
> > > to
> > > be not running, that resource will stay not running unless
> > > started by
> > > the cluster. (It will re-probe if the node goes away and comes
> > > back.)
> > > 
> > > If you wish to guard against resources being started outside
> > > cluster
> > > control, configure a recurring monitor with target-role=Stopped,
> > > and
> > > the cluster will run that on all nodes where it thinks the
> > > resource
> > > is
> > > not supposed to be running. Of course since it has to poll at
> > > intervals, it can take up to that much time to detect a manually
> > > started instance.
> > 
> > Alternatively, if you don't want the overhead of a recurring
> > monitor
> > but want to be able to address known manual starts yourself, you
> > can
> > force a full reprobe of the resource with "crm_resource -r
> > <resource-
> > id> --refresh".
> > 
> > If you do it before starting the resource via crm, the cluster will
> > stop the manually started instance, and then you can start it via
> > the
> > crm; if you do it after starting the resource via crm, there will
> > still
> > likely be two active instances, and the cluster will stop both and
> > start one again.
> > 
> > A way around that would be to unmanage the resource, start the
> > resource
> > via crm (which won't actually start anything due to being
> > unmanaged,
> > but will tell the cluster it's supposed to be started), force a
> > reprobe, then manage the resource again -- that should prevent
> > multiple
> > active. However if the cluster prefers a different node, it may
> > still
> > stop the resource and start it in its preferred location.
> > (Stickiness
> > could get around that.)
> 
> Hi!
> 
> Thanks again for that. There's one question that comes to my mind:
> What is the purpose of the cluster recheck interval? I thought it's
> exactly that, finding resources that are not in the state they should
> be.

I can see how the name would suggest that, but nope, it's just a
recalculation of whether any actions need to be taken.

It comes in handy for two purposes: first, rules and some options (such
as failure-timeout) that depend on time values are not guaranteed to be
evaluated more often than the recheck interval. So if you have a rule
setting maintenance mode between 10:30 and 11pm, if the recheck
interval is 15 minutes, maintenance mode could be entered anytime
between 10:30 and 10:45, and exited anytime between 11 and 11:15.

Second, it's a fail-safe for bugs that cause the cluster to miss an
event. If the cluster fails to react to a recorded event, it should
notice when the next recheck interval expires.

> 
> Regards,
> Ulrich
> 
> 
> > 
> > > 
> > > > > > > Examining the logs, it seems that the recheck timer
> > > > > > > popped
> > > > > > > periodically, but no monitor action was run for the VM
> > > > > > > (the
> > > > > > > action
> > > > > > > is configured to run every 10 minutes).
> > > 
> > > Recurring monitors are only recorded in the log if their return
> > > value
> > > changed. If there are 10 successful monitors in a row and then a
> > > failure, only the first success and the failure are logged.
> > > 
> > > > > > > 
> > > > > > > Actually the only monitor operations found were:
> > > > > > > May 23 08:04:13
> > > > > > > Jun 13 08:13:03
> > > > > > > Jun 25 09:29:04
> > > > > > > Then a manual "reprobe" was done, and several monitor
> > > > > > > operations
> > > > > > > were run.
> > > > > > > Then again I see no more monitor actions in syslog.
> > > > > > > 
> > > > > > > What could be the reasons for this? Too many operations
> > > > > > > defined?
> > > > > > > 
> > > > > > > The other message I don't understand is like "<other-
> > > > > > > resource>:
> > > > > > > Rolling back scores from <vm-resource>"
> > > > > > > 
> > > > > > > Could it be a new bug introduced in pacemaker, or could
> > > > > > > it be
> > > > > > > some
> > > > > > > configuration problem (The status is completely clean
> > > > > > > however)?
> > > > > > > 
> > > > > > > According to the packet changelog, there was no change
> > > > > > > since
> > > > > > > Nov
> > > > > > > 2016...
> > > > > > > 
> > > > > > > Regards,
> > > > > > > Ulrich
> > 
> > -- 
> > Ken Gaillot <kgaillot at redhat.com>
> > _______________________________________________
> > Users mailing list: Users at clusterlabs.org 
> > https://lists.clusterlabs.org/mailman/listinfo/users 
> > 
> > Project Home: http://www.clusterlabs.org 
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratc
> > h.pdf 
> > Bugs: http://bugs.clusterlabs.org 
> 
> 
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.
> pdf
> Bugs: http://bugs.clusterlabs.org
-- 
Ken Gaillot <kgaillot at redhat.com>



More information about the Users mailing list