[ClusterLabs] Antw: Re: Antw: Re: Resources not monitored in SLES11 SP4 (1.1.12-f47ea56)

Thu Jun 28 07:13:32 UTC 2018

>>> Ken Gaillot <kgaillot at redhat.com> schrieb am 27.06.2018 um 16:32 in Nachricht
<1530109926.6452.3.camel at redhat.com>:
> On Wed, 2018-06-27 at 09:18 -0500, Ken Gaillot wrote:
>> On Wed, 2018-06-27 at 07:41 +0200, Ulrich Windl wrote:
>> > > > > Ken Gaillot <kgaillot at redhat.com> schrieb am 26.06.2018 um
>> > > > > 18:22 in Nachricht
>> > 
>> > <1530030128.5202.5.camel at redhat.com>:
>> > > On Tue, 2018-06-26 at 10:45 +0300, Vladislav Bogdanov wrote:
>> > > > 26.06.2018 09:14, Ulrich Windl wrote:
>> > > > > Hi!
>> > > > > 
>> > > > > We just observed some strange effect we cannot explain in
>> > > > > SLES
>> > > > > 11
>> > > > > SP4 (pacemaker 1.1.12-f47ea56):
>> > > > > We run about a dozen of Xen PVMs on a three-node cluster
>> > > > > (plus
>> > > > > some
>> > > > > infrastructure and monitoring stuff). It worked all well so
>> > > > > far,
>> > > > > and there was no significant change recently.
>> > > > > However when a colleague stopped on VM for maintenance via
>> > > > > cluster
>> > > > > command, the cluster did not notice when the PVM actually was
>> > > > > running again (it had been started not using the cluster (a
>> > > > > bad
>> > > > > idea, I know)).
>> > > > 
>> > > > To be on a safe side in such cases you'd probably want to
>> > > > enable 
>> > > > additional monitor for a "Stopped" role. Default one covers
>> > > > only 
>> > > > "Started" role. The same thing as for multistate resources,
>> > > > where
>> > > > you 
>> > > > need several monitor ops, for "Started/Slave" and "Master"
>> > > > roles.
>> > > > But, this will increase a load.
>> > > > And, I believe cluster should reprobe a resource on all nodes
>> > > > once
>> > > > you 
>> > > > change target-role back to "Started".
>> > > 
>> > > Which raises the question, how did you stop the VM initially?
>> > 
>> > I thought "(...) stopped one VM for maintenance via cluster
>> > command"
>> > is obvious. It was something like "crm resource stop ...".
>> > 
>> > > 
>> > > If you stopped it by setting target-role to Stopped, likely the
>> > > cluster
>> > > still thinks it's stopped, and you need to set it to Started
>> > > again.
>> > > If
>> > > instead you set maintenance mode or unmanaged the resource, then
>> > > stopped the VM manually, then most likely it's still in that mode
>> > > and
>> > > needs to be taken out of it.
>> > 
>> > The point was when the command to start the resource was given, the
>> > cluster had completely ignored the fact that it was running already
>> > and started to start the VM on a second node (which may be
>> > desastrous). But that's leading away from the main question...
>> 
>> Ah, this is expected behavior when you start a resource manually, and
>> there are no monitors with target-role=Stopped. If the node where you
>> manually started the VM isn't the same node the cluster happens to
>> choose, then you can get multiple active instances.
>> 
>> By default, the cluster assumes that where a probe found a resource
>> to
>> be not running, that resource will stay not running unless started by
>> the cluster. (It will re-probe if the node goes away and comes back.)
>> 
>> If you wish to guard against resources being started outside cluster
>> control, configure a recurring monitor with target-role=Stopped, and
>> the cluster will run that on all nodes where it thinks the resource
>> is
>> not supposed to be running. Of course since it has to poll at
>> intervals, it can take up to that much time to detect a manually
>> started instance.
> 
> Alternatively, if you don't want the overhead of a recurring monitor
> but want to be able to address known manual starts yourself, you can
> force a full reprobe of the resource with "crm_resource -r <resource-
> id> --refresh".
> 
> If you do it before starting the resource via crm, the cluster will
> stop the manually started instance, and then you can start it via the
> crm; if you do it after starting the resource via crm, there will still
> likely be two active instances, and the cluster will stop both and
> start one again.
> 
> A way around that would be to unmanage the resource, start the resource
> via crm (which won't actually start anything due to being unmanaged,
> but will tell the cluster it's supposed to be started), force a
> reprobe, then manage the resource again -- that should prevent multiple
> active. However if the cluster prefers a different node, it may still
> stop the resource and start it in its preferred location. (Stickiness
> could get around that.)

Hi!

Thanks again for that. There's one question that comes to my mind: What is the purpose of the cluster recheck interval? I thought it's exactly that, finding resources that are not in the state they should be.

Regards,
Ulrich

> 
>> 
>> > > > > Examining the logs, it seems that the recheck timer popped
>> > > > > periodically, but no monitor action was run for the VM (the
>> > > > > action
>> > > > > is configured to run every 10 minutes).
>> 
>> Recurring monitors are only recorded in the log if their return value
>> changed. If there are 10 successful monitors in a row and then a
>> failure, only the first success and the failure are logged.
>> 
>> > > > > 
>> > > > > Actually the only monitor operations found were:
>> > > > > May 23 08:04:13
>> > > > > Jun 13 08:13:03
>> > > > > Jun 25 09:29:04
>> > > > > Then a manual "reprobe" was done, and several monitor
>> > > > > operations
>> > > > > were run.
>> > > > > Then again I see no more monitor actions in syslog.
>> > > > > 
>> > > > > What could be the reasons for this? Too many operations
>> > > > > defined?
>> > > > > 
>> > > > > The other message I don't understand is like "<other-
>> > > > > resource>:
>> > > > > Rolling back scores from <vm-resource>"
>> > > > > 
>> > > > > Could it be a new bug introduced in pacemaker, or could it be
>> > > > > some
>> > > > > configuration problem (The status is completely clean
>> > > > > however)?
>> > > > > 
>> > > > > According to the packet changelog, there was no change since
>> > > > > Nov
>> > > > > 2016...
>> > > > > 
>> > > > > Regards,
>> > > > > Ulrich
> -- 
> Ken Gaillot <kgaillot at redhat.com>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org 
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> Bugs: http://bugs.clusterlabs.org