[ClusterLabs] big trouble with a DRBD resource

Mon Aug 7 12:54:58 CEST 2017

----- On Aug 4, 2017, at 10:19 PM, Ken Gaillot kgaillot at redhat.com wrote:

> 
> Unfortunately no -- logging, and troubleshooting in general, is an area
> we are continually striving to improve, but there are more to-do's than
> time to do them.

sad but comprehensible. Is it worth trying to understand the logs or should i keep an eye on
hb-report or crm history ? I played a bit around with hb_report but it seems it just collects information already available and does not simplify the view on it.

> The "ERROR" message is coming from the DRBD resource agent itself, not
> pacemaker. Between that message and the two separate monitor operations,
> it looks like the agent will only run as a master/slave clone.

Yes. I see it in the RA.

>> And why does it complain that stop is not configured ?
> 
> A confusing error message. It's not complaining that the operations are
> not configured, it's saying the operations failed because the resource
> is not properly configured. What "properly configured" means is up to
> the individual resource agent.

Aah. And why does it not complain a "failed" start op ?
Because i have "target-role=stopped" in rsc_defaults ? So it tries not to start but stop the resource initially ?

>> The DC says:
>> Aug  1 14:19:33 ha-idg-2 pengine[27043]:  warning: unpack_rsc_op_failure:
>> Processing failed op stop for prim_drbd_idcc_devel on ha-idg-1: not configured
>> (6)
>> Aug  1 14:19:33 ha-idg-2 pengine[27043]:    error: unpack_rsc_op: Preventing
>> prim_drbd_idcc_devel from re-starting anywhere: operation stop failed 'not
>> configured' (6)
>> 
>> Again complaining about a failed stop, saying it's not configured. Or does it
>> complain that the fail of a stop op is not configured ?
> 
> Again, it's confusing, but you have various logs of the same event
> coming from three different places.
> 
> First, DRBD logged that there is a "meta parameter misconfigured". It
> then reported that error value back to the crmd cluster daemon that
> called it, so the crmd logged the error as well, that the result of the
> operation was "not configured".
> 
> Then (above), when the policy engine reads the current status of the
> cluster, it sees that there is a failed operation, so it decides what to
> do about the failure.

Ok.

>> The doc says:
>> "Some operations are generated by the cluster itself, for example, stopping and
>> starting resources as needed."
>> http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/_resource_operations.html
>> . Is the doc wrong ?
>> What happens when i DON'T configure start/stop operations ? Are they configured
>> automatically ?
>> I have several primitives without a configured start/stop operation, but never
>> had any problems with them.
> 
> Start and stop are indeed created by the cluster itself. If there are
> start and stop operations configured in the cluster configuration, those
> are used solely to get the meta-attributes such as timeout, to override
> the defaults.

Ok.

>> failcount is direct INFINITY:
>> Aug  1 14:19:33 ha-idg-1 attrd[4690]:   notice: attrd_trigger_update: Sending
>> flush op to all hosts for: fail-count-prim_drbd_idcc_devel (INFINITY)
>> Aug  1 14:19:33 ha-idg-1 attrd[4690]:   notice: attrd_perform_update: Sent
>> update 8: fail-count-prim_drbd_idcc_devel=INFINITY
> 
> Yes, a few result codes are considered "fatal", or automatically
> INFINITY failures. The idea is that if the resource is misconfigured,
> that's not going to change by simply re-running the agent.

That makes sense.

> 
>> After exact 9 minutes the complaints about the not configured stop operation
>> stopped, the complaints about missing clone-max still appears, although both
>> nodes are in standby
> 
> I'm not sure why your nodes are in standby, but that should be unrelated
> to all of this, unless perhaps you configured on-fail=standby.

They are in standby because it put them manually into this state.

> 
>> now fail-count is 1 million:
>> Aug  1 14:28:33 ha-idg-1 attrd[4690]:   notice: attrd_trigger_update: Sending
>> flush op to all hosts for: fail-count-prim_drbd_idcc_devel (1000000)
>> Aug  1 14:28:33 ha-idg-1 attrd[4690]:   notice: attrd_perform_update: Sent
>> update 7076: fail-count-prim_drbd_idcc_devel=1000000
> 
> Within Pacemaker, INFINITY = 1000000. I'm not sure why it's logged
> differently here, but it's the same value.

Ok.

>> A big problem was that i have a ClusterMon resource running on each node. It
>> triggered about 20000 snmp traps in 193 seconds to my management station, which
>> triggered 20000 e-Mails ...
>> From where comes this incredible amount of traps ? Nearly all traps said that
>> stop is not configured for the drdb resource. Why complaining so often ? And
>> why stopping after ~20.000 traps ?
>> And complaining about not configured monitor operation just 8 times.
> 
> I'm not really sure; I haven't used ClusterMon enough to say. If you
> have Pacemaker 1.1.15 or later, the alerts feature is preferred to
> ClusterMon.

I have 1.12.
Do you have experience with the snmp monitoring from sys4 https://github.com/sys4/pacemaker-snmp ?

Bernd

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir'in Baerbel Brumme-Bothe
Geschaeftsfuehrer: Prof. Dr. Guenther Wess, Heinrich Bassler, Dr. Alfons Enhsen
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671