[ClusterLabs] big trouble with a DRBD resource

Wed Aug 16 14:30:53 UTC 2017

On Wed, 2017-08-16 at 15:20 +0200, Lentes, Bernd wrote:
> 
> > Hi,
> > 
> 
> > 
> > What happened:
> > I tried to configure a simple drbd resource following
> > http://clusterlabs.org/doc/en-US/Pacemaker/1.1-plugin/html-single/Clusters_from_Scratch/index.html#idm140457860751296
> > I used this simple snip from the doc:
> > configure primitive WebData ocf:linbit:drbd params drbd_resource=wwwdata \
> >    op monitor interval=60s
> > 
> > I did it on live cluster, which is in testing currently. I will never do this
> > again. Shadow will be my friend.
> > 
> > The cluster reacted promptly:
> > crm(live)# configure primitive prim_drbd_idcc_devel ocf:linbit:drbd params
> > drbd_resource=idcc-devel \
> >   > op monitor interval=60
> > WARNING: prim_drbd_idcc_devel: default timeout 20s for start is smaller than the
> > advised 240
> > WARNING: prim_drbd_idcc_devel: default timeout 20s for stop is smaller than the
> > advised 100
> > WARNING: prim_drbd_idcc_devel: action monitor not advertised in meta-data, it
> > may not be supported by the RA
> > 
> > From what i understand until now is that i didn't configure start/stop
> > operations, so the cluster chooses the default from default-action-timeout.
> > It didn't configure the monitor operation, because this is not in the meta-data.
> 
> > 
> > The log says:
> > Aug  1 14:19:33 ha-idg-1 drbd(prim_drbd_idcc_devel)[11325]: ERROR: meta
> > parameter misconfigured, expected clone-max -le 2, but found unset.
> >                                                                                                          ^^^^^^^^^
> > Aug  1 14:19:33 ha-idg-1 crmd[4692]:   notice: process_lrm_event: Operation
> > prim_drbd_idcc_devel_monitor_0: not configured (node=ha-idg-1, call=73, rc=6,
> > cib-update=37, confirmed=true)
> > Aug  1 14:19:33 ha-idg-1 crmd[4692]:   notice: process_lrm_event: Operation
> > prim_drbd_idcc_devel_stop_0: not configured (node=ha-idg-1, call=74, rc=6,
> > cib-update=38, confirmed=true)
> > 
> 
> > 
> > crm_mon said:
> > Failed actions:
> >    prim_drbd_idcc_devel_stop_0 on ha-idg-1 'not configured' (6): call=6967,
> >    status=complete, exit-reason='none', last-rc-change='Tue Aug  1 14:28:33 2017',
> >    queued=0ms, exec=41ms
> >    prim_drbd_idcc_devel_monitor_60000 on ha-idg-1 'not configured' (6): call=6968,
> >    status=complete, exit-reason='none', last-rc-change='Tue Aug  1 14:28:33 2017',
> >    queued=0ms, exec=41ms
> >    prim_drbd_idcc_devel_stop_0 on ha-idg-2 'not configured' (6): call=6963,
> >    status=complete, exit-reason='none', last-rc-change='Tue Aug  1 14:28:33 2017',
> >    queued=0ms, exec=40ms
> > 
> > A big problem was that i have a ClusterMon resource running on each node. It
> > triggered about 20000 snmp traps in 193 seconds to my management station, which
> > triggered 20000 e-Mails ...
> > From where comes this incredible amount of traps ? Nearly all traps said that
> > stop is not configured for the drdb resource. Why complaining so often ? And
> > why stopping after ~20.000 traps ?
> > And complaining about not configured monitor operation just 8 times.
> 
> Ok. I configured the drbd resource wrong/completely, and that caused the trouble.
> What i would like to know:
> - from where does crm_mon retrieves its information ?

It uses the C API to be notified of CIB changes (which has all the
cluster state) and stonith events, and additionally polls the state
every couple of seconds.

> - why did i get tons of lines in syslog ? One message that the resource isn't configured correctly/completely would be enough.
> I got thousands and thousands lines telling the same.

I'm not sure from this information. Most commonly, if a resource agent
start fails, and migration-threshold is left at the default (1,000,000),
it's the result of retrying start/stop repeatedly. However, "not
configured" is a fatal error, so pacemaker wouldn't retry that
particular operation. It would log the message every time a new
operation was executed and returned that result, and every time it did a
policy engine run (until the error was cleaned up).

> 
> Bernd
>  
> 
> Helmholtz Zentrum Muenchen
> Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
> Ingolstaedter Landstr. 1
> 85764 Neuherberg
> www.helmholtz-muenchen.de
> Aufsichtsratsvorsitzende: MinDir'in Baerbel Brumme-Bothe
> Geschaeftsfuehrer: Prof. Dr. Guenther Wess, Heinrich Bassler, Dr. Alfons Enhsen
> Registergericht: Amtsgericht Muenchen HRB 6466
> USt-IdNr: DE 129521671