[ClusterLabs] big trouble with a DRBD resource

Mon Aug 7 16:43:55 EDT 2017

On Mon, 2017-08-07 at 12:54 +0200, Lentes, Bernd wrote:
> ----- On Aug 4, 2017, at 10:19 PM, Ken Gaillot kgaillot at redhat.com wrote:
> 
> > Unfortunately no -- logging, and troubleshooting in general, is an area
> > we are continually striving to improve, but there are more to-do's than
> > time to do them.
> 
> sad but comprehensible. Is it worth trying to understand the logs or should i keep an eye on
> hb-report or crm history ? I played a bit around with hb_report but it seems it just collects information already available and does not simplify the view on it.

The logs are very useful, but not particularly easy to follow. It takes
some practice and experience, but I think it's worth it if you have to
troubleshoot cluster events often.

It's on the to-do list to create a "Troubleshooting Pacemaker" document
that helps with this and using tools such as crm_simulate.

The first step in understanding the logs is to learn what the pacemaker
daemons are and what they do, and what the DC node is. It starts to make
more sense from there:

   pacemakerd: spawns all other daemons and re-spawns them if they crash
   attrd: manages node attributes
   cib: manages reading/writing the configuration
   lrmd: executes resource agents
   pengine: given a cluster state, determines any actions needed
   crmd: manages cluster membership and carries out the pengine's
decisions by asking the lrmd to perform actions

At any given time, one node's crmd in the cluster (or partition if there
is a network split) is elected as the DC (designated controller). The DC
asks the pengine what needs to be done, then farms out the results to
all the other crmd's, which (if necessary) call their local lrmd to
actually execute the actions.

> > The "ERROR" message is coming from the DRBD resource agent itself, not
> > pacemaker. Between that message and the two separate monitor operations,
> > it looks like the agent will only run as a master/slave clone.
> 
> Yes. I see it in the RA.
> 
> >> And why does it complain that stop is not configured ?
> > 
> > A confusing error message. It's not complaining that the operations are
> > not configured, it's saying the operations failed because the resource
> > is not properly configured. What "properly configured" means is up to
> > the individual resource agent.
> 
> Aah. And why does it not complain a "failed" start op ?
> Because i have "target-role=stopped" in rsc_defaults ? So it tries not to start but stop the resource initially ?

target-role=Stopped will indeed prevent it from trying to start, which
explains why there's no message for that. It shouldn't try a stop though
unless one is needed, so I'm not sure offhand why the stop was
initiated.

> 
> >> The DC says:
> >> Aug  1 14:19:33 ha-idg-2 pengine[27043]:  warning: unpack_rsc_op_failure:
> >> Processing failed op stop for prim_drbd_idcc_devel on ha-idg-1: not configured
> >> (6)
> >> Aug  1 14:19:33 ha-idg-2 pengine[27043]:    error: unpack_rsc_op: Preventing
> >> prim_drbd_idcc_devel from re-starting anywhere: operation stop failed 'not
> >> configured' (6)
> >> 
> >> Again complaining about a failed stop, saying it's not configured. Or does it
> >> complain that the fail of a stop op is not configured ?
> > 
> > Again, it's confusing, but you have various logs of the same event
> > coming from three different places.
> > 
> > First, DRBD logged that there is a "meta parameter misconfigured". It
> > then reported that error value back to the crmd cluster daemon that
> > called it, so the crmd logged the error as well, that the result of the
> > operation was "not configured".
> > 
> > Then (above), when the policy engine reads the current status of the
> > cluster, it sees that there is a failed operation, so it decides what to
> > do about the failure.
> 
> Ok.
>  
> >> The doc says:
> >> "Some operations are generated by the cluster itself, for example, stopping and
> >> starting resources as needed."
> >> http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/_resource_operations.html
> >> . Is the doc wrong ?
> >> What happens when i DON'T configure start/stop operations ? Are they configured
> >> automatically ?
> >> I have several primitives without a configured start/stop operation, but never
> >> had any problems with them.
> > 
> > Start and stop are indeed created by the cluster itself. If there are
> > start and stop operations configured in the cluster configuration, those
> > are used solely to get the meta-attributes such as timeout, to override
> > the defaults.
> 
> Ok.
> 
> 
> >> failcount is direct INFINITY:
> >> Aug  1 14:19:33 ha-idg-1 attrd[4690]:   notice: attrd_trigger_update: Sending
> >> flush op to all hosts for: fail-count-prim_drbd_idcc_devel (INFINITY)
> >> Aug  1 14:19:33 ha-idg-1 attrd[4690]:   notice: attrd_perform_update: Sent
> >> update 8: fail-count-prim_drbd_idcc_devel=INFINITY
> > 
> > Yes, a few result codes are considered "fatal", or automatically
> > INFINITY failures. The idea is that if the resource is misconfigured,
> > that's not going to change by simply re-running the agent.
> 
> That makes sense.
> 
> > 
> >> After exact 9 minutes the complaints about the not configured stop operation
> >> stopped, the complaints about missing clone-max still appears, although both
> >> nodes are in standby
> > 
> > I'm not sure why your nodes are in standby, but that should be unrelated
> > to all of this, unless perhaps you configured on-fail=standby.
> 
> They are in standby because it put them manually into this state.
> 
> > 
> >> now fail-count is 1 million:
> >> Aug  1 14:28:33 ha-idg-1 attrd[4690]:   notice: attrd_trigger_update: Sending
> >> flush op to all hosts for: fail-count-prim_drbd_idcc_devel (1000000)
> >> Aug  1 14:28:33 ha-idg-1 attrd[4690]:   notice: attrd_perform_update: Sent
> >> update 7076: fail-count-prim_drbd_idcc_devel=1000000
> > 
> > Within Pacemaker, INFINITY = 1000000. I'm not sure why it's logged
> > differently here, but it's the same value.
> 
> Ok.
> 
> 
> >> A big problem was that i have a ClusterMon resource running on each node. It
> >> triggered about 20000 snmp traps in 193 seconds to my management station, which
> >> triggered 20000 e-Mails ...
> >> From where comes this incredible amount of traps ? Nearly all traps said that
> >> stop is not configured for the drdb resource. Why complaining so often ? And
> >> why stopping after ~20.000 traps ?
> >> And complaining about not configured monitor operation just 8 times.
> > 
> > I'm not really sure; I haven't used ClusterMon enough to say. If you
> > have Pacemaker 1.1.15 or later, the alerts feature is preferred to
> > ClusterMon.
> 
> I have 1.12.
> Do you have experience with the snmp monitoring from sys4 https://github.com/sys4/pacemaker-snmp ?

Not familiar

> 
> Bernd
>  
> 
> Helmholtz Zentrum Muenchen
> Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
> Ingolstaedter Landstr. 1
> 85764 Neuherberg
> www.helmholtz-muenchen.de
> Aufsichtsratsvorsitzende: MinDir'in Baerbel Brumme-Bothe
> Geschaeftsfuehrer: Prof. Dr. Guenther Wess, Heinrich Bassler, Dr. Alfons Enhsen
> Registergericht: Amtsgericht Muenchen HRB 6466
> USt-IdNr: DE 129521671
> 

-- 
Ken Gaillot <kgaillot at redhat.com>