[ClusterLabs] big trouble with a DRBD resource

Fri Aug 4 16:19:47 EDT 2017

On Fri, 2017-08-04 at 18:20 +0200, Lentes, Bernd wrote:
> Hi,
> 
> first: is there a tutorial or s.th. else which helps in understanding what pacemaker logs in syslog and /var/log/cluster/corosync.log ?
> I try hard to find out what's going wrong, but they are difficult to understand, also because of the amount of information.
> Or should i deal more with "crm histroy" or hb_report ?

Unfortunately no -- logging, and troubleshooting in general, is an area
we are continually striving to improve, but there are more to-do's than
time to do them.

> 
> What happened:
> I tried to configure a simple drbd resource following http://clusterlabs.org/doc/en-US/Pacemaker/1.1-plugin/html-single/Clusters_from_Scratch/index.html#idm140457860751296
> I used this simple snip from the doc:
> configure primitive WebData ocf:linbit:drbd params drbd_resource=wwwdata \
>     op monitor interval=60s
> 
> I did it on live cluster, which is in testing currently. I will never do this again. Shadow will be my friend.

lol, yep

> The cluster reacted promptly:
> crm(live)# configure primitive prim_drbd_idcc_devel ocf:linbit:drbd params drbd_resource=idcc-devel \
>    > op monitor interval=60
> WARNING: prim_drbd_idcc_devel: default timeout 20s for start is smaller than the advised 240
> WARNING: prim_drbd_idcc_devel: default timeout 20s for stop is smaller than the advised 100
> WARNING: prim_drbd_idcc_devel: action monitor not advertised in meta-data, it may not be supported by the RA
> 
> From what i understand until now is that i didn't configure start/stop operations, so the cluster chooses the default from default-action-timeout.
> It didn't configure the monitor operation, because this is not in the meta-data.
> 
> I checked it:
> crm(live)# ra info ocf:linbit:drbd
> Manages a DRBD device as a Master/Slave resource (ocf:linbit:drbd)
> 
> Operations' defaults (advisory minimum):
> 
>     start         timeout=240
>     promote       timeout=90
>     demote        timeout=90
>     notify        timeout=90
>     stop          timeout=100
>     monitor_Slave timeout=20 interval=20
>     monitor_Master timeout=20 interval=10
> 
> OK. I have to configure monitor_Slave and monitor_Master.
> 
> The log says:
> Aug  1 14:19:33 ha-idg-1 drbd(prim_drbd_idcc_devel)[11325]: ERROR: meta parameter misconfigured, expected clone-max -le 2, but found unset.
>                                                                                                           ^^^^^^^^^
> Aug  1 14:19:33 ha-idg-1 crmd[4692]:   notice: process_lrm_event: Operation prim_drbd_idcc_devel_monitor_0: not configured (node=ha-idg-1, call=73, rc=6, cib-update=37, confirmed=true)
> Aug  1 14:19:33 ha-idg-1 crmd[4692]:   notice: process_lrm_event: Operation prim_drbd_idcc_devel_stop_0: not configured (node=ha-idg-1, call=74, rc=6, cib-update=38, confirmed=true)
> 
> Why is it complaining about missing clone-max ? This is a meta attribute for a clone, but not for a simple resource !?! This message is constantly repeated, it still appears although cluster is in standby since three days.

The "ERROR" message is coming from the DRBD resource agent itself, not
pacemaker. Between that message and the two separate monitor operations,
it looks like the agent will only run as a master/slave clone.

> And why does it complain that stop is not configured ?

A confusing error message. It's not complaining that the operations are
not configured, it's saying the operations failed because the resource
is not properly configured. What "properly configured" means is up to
the individual resource agent.

> Isn't that configured with the default of 20 sec. ? That's what crm said. See above. This message is also repeated nearly 7000 times in 9 minutes.
> If the stop op is not configured and the cluster complains about it, why does it not complain about a unconfigured start op ?
> That the missing monitor is complained is clear.
> 
> The DC says:
> Aug  1 14:19:33 ha-idg-2 pengine[27043]:  warning: unpack_rsc_op_failure: Processing failed op stop for prim_drbd_idcc_devel on ha-idg-1: not configured (6)
> Aug  1 14:19:33 ha-idg-2 pengine[27043]:    error: unpack_rsc_op: Preventing prim_drbd_idcc_devel from re-starting anywhere: operation stop failed 'not configured' (6)
> 
> Again complaining about a failed stop, saying it's not configured. Or does it complain that the fail of a stop op is not configured ?

Again, it's confusing, but you have various logs of the same event
coming from three different places.

First, DRBD logged that there is a "meta parameter misconfigured". It
then reported that error value back to the crmd cluster daemon that
called it, so the crmd logged the error as well, that the result of the
operation was "not configured".

Then (above), when the policy engine reads the current status of the
cluster, it sees that there is a failed operation, so it decides what to
do about the failure.

> The doc says:
> "Some operations are generated by the cluster itself, for example, stopping and starting resources as needed."
> http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/_resource_operations.html . Is the doc wrong ?
> What happens when i DON'T configure start/stop operations ? Are they configured automatically ?
> I have several primitives without a configured start/stop operation, but never had any problems with them.

Start and stop are indeed created by the cluster itself. If there are
start and stop operations configured in the cluster configuration, those
are used solely to get the meta-attributes such as timeout, to override
the defaults.

> failcount is direct INFINITY:
> Aug  1 14:19:33 ha-idg-1 attrd[4690]:   notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-prim_drbd_idcc_devel (INFINITY)
> Aug  1 14:19:33 ha-idg-1 attrd[4690]:   notice: attrd_perform_update: Sent update 8: fail-count-prim_drbd_idcc_devel=INFINITY

Yes, a few result codes are considered "fatal", or automatically
INFINITY failures. The idea is that if the resource is misconfigured,
that's not going to change by simply re-running the agent.

> After exact 9 minutes the complaints about the not configured stop operation stopped, the complaints about missing clone-max still appears, although both nodes are in standby

I'm not sure why your nodes are in standby, but that should be unrelated
to all of this, unless perhaps you configured on-fail=standby.

> now fail-count is 1 million:
> Aug  1 14:28:33 ha-idg-1 attrd[4690]:   notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-prim_drbd_idcc_devel (1000000)
> Aug  1 14:28:33 ha-idg-1 attrd[4690]:   notice: attrd_perform_update: Sent update 7076: fail-count-prim_drbd_idcc_devel=1000000

Within Pacemaker, INFINITY = 1000000. I'm not sure why it's logged
differently here, but it's the same value.

> and a complain about monitor operation appeared again:
> Aug  1 14:28:33 ha-idg-1 crmd[4692]:   notice: process_lrm_event: Operation prim_drbd_idcc_devel_monitor_60000: not configured (node=ha-idg-1, call=6968, rc=6, cib-update=6932, confirmed=false)
> Aug  1 14:28:33 ha-idg-1 attrd[4690]:   notice: attrd_cs_dispatch: Update relayed from ha-idg-2
> 
> crm_mon said:
> Failed actions:
>     prim_drbd_idcc_devel_stop_0 on ha-idg-1 'not configured' (6): call=6967, status=complete, exit-reason='none', last-rc-change='Tue Aug  1 14:28:33 2017', queued=0ms, exec=41ms
>     prim_drbd_idcc_devel_monitor_60000 on ha-idg-1 'not configured' (6): call=6968, status=complete, exit-reason='none', last-rc-change='Tue Aug  1 14:28:33 2017', queued=0ms, exec=41ms
>     prim_drbd_idcc_devel_stop_0 on ha-idg-2 'not configured' (6): call=6963, status=complete, exit-reason='none', last-rc-change='Tue Aug  1 14:28:33 2017', queued=0ms, exec=40ms
> 
> A big problem was that i have a ClusterMon resource running on each node. It triggered about 20000 snmp traps in 193 seconds to my management station, which triggered 20000 e-Mails ...
> From where comes this incredible amount of traps ? Nearly all traps said that stop is not configured for the drdb resource. Why complaining so often ? And why stopping after ~20.000 traps ?
> And complaining about not configured monitor operation just 8 times.

I'm not really sure; I haven't used ClusterMon enough to say. If you
have Pacemaker 1.1.15 or later, the alerts feature is preferred to
ClusterMon.

> Btw: is there a history like in the bash where i see which crm command i entered at which time ? I know that crm history is mighty, but didn't find that.
> 
> 
> 
> 
> Bernd

-- 
Ken Gaillot <kgaillot at redhat.com>