[ClusterLabs] Debugging problems with resource timeout without any actions from cluster

Tue Oct 17 05:30:13 EDT 2017

Ken Gaillot ☫ → To Cluster Labs - All topics related to open-source clustering welcomed @ Thu, Oct 12, 2017 09:47 -0500

Thanks for the answer, Ken,

> > I found several ways to achieve that:
> > 
> > 1. Put cluster in maintainance mode (as described here:
> >    https://www.hastexo.com/resources/hints-and-kinks/maintenance-acti
> > ve-pacemaker-clusters/)
> > 
> >    As far as I understand, services will be monitored, all logs
> > written,
> >    etc., but no action in case of failures will be taken. Is that
> > right?
> 
> Actually, maintenance mode stops all monitors (except those with
> role=Stopped, which ensure a service is not running).

OK, got it.

> > 2. Put the particular resource to unmanaged mode, as described here:
> >    http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html-single/Pacemak
> > er_Explained/#s-monitoring-unmanaged
> 
> Disabling starts and stops is the exact purpose of unmanaged, so this
> is one way to get what you want. FYI you can also set this as a global
> default for all resources by setting it in the resource defaults
> section of the configuration.

OK, got it too.

> > 3. Start all resources and remove start and stop operations from
> > them.
> 
> :-O

This is kinda quirky way, but it exists! :-)

> > Which is the best way to achieve my purpose? I would like cluster to
> > run
> > as usual (and logging as usual or with trace on problematic
> > resource),
> > but no action in case of monitor failure should be taken.
> 
> That's actually a different goal, also easily accomplished, by setting
> on-fail=ignore on the monitor operation. From the sound of it, this is
> closer to what you want, since the cluster is still allowed to
> start/stop resources when you standby a node, etc.

I'll try this one.

> You could also delete the recurring monitor operation from the
> configuration, and it wouldn't run at all. But keeping it and setting
> on-fail=ignore lets you see failures in cluster status.

> However, I'm not sure bypassing the monitor is the best solution to
> this problem. If the problem is simply that your database monitor can
> legitimately take longer than 20 seconds in normal operation, then
> raise the timeout as needed.

I want to determine why it needed more than 20 seconds, and under what
circumstances.

-- 
Bright regards, Sergey Korobitsin,
Chief Research Officer
Arta Software, http://arta.kz/
xmpp:undertaker at jabber.arta.kz

не противостоять этой тенценции; самым решительным броском вперед - идеей, 
и наиболее творческим из всех действий - бездельем.
  -- Тристан Тцара