[ClusterLabs] big trouble with a DRBD resource

Fri Aug 4 18:20:22 CEST 2017

Hi,

first: is there a tutorial or s.th. else which helps in understanding what pacemaker logs in syslog and /var/log/cluster/corosync.log ?
I try hard to find out what's going wrong, but they are difficult to understand, also because of the amount of information.
Or should i deal more with "crm histroy" or hb_report ?

What happened:
I tried to configure a simple drbd resource following http://clusterlabs.org/doc/en-US/Pacemaker/1.1-plugin/html-single/Clusters_from_Scratch/index.html#idm140457860751296
I used this simple snip from the doc:
configure primitive WebData ocf:linbit:drbd params drbd_resource=wwwdata \
    op monitor interval=60s

I did it on live cluster, which is in testing currently. I will never do this again. Shadow will be my friend.

The cluster reacted promptly:
crm(live)# configure primitive prim_drbd_idcc_devel ocf:linbit:drbd params drbd_resource=idcc-devel \
   > op monitor interval=60
WARNING: prim_drbd_idcc_devel: default timeout 20s for start is smaller than the advised 240
WARNING: prim_drbd_idcc_devel: default timeout 20s for stop is smaller than the advised 100
WARNING: prim_drbd_idcc_devel: action monitor not advertised in meta-data, it may not be supported by the RA

>From what i understand until now is that i didn't configure start/stop operations, so the cluster chooses the default from default-action-timeout.
It didn't configure the monitor operation, because this is not in the meta-data.

I checked it:
crm(live)# ra info ocf:linbit:drbd
Manages a DRBD device as a Master/Slave resource (ocf:linbit:drbd)

Operations' defaults (advisory minimum):

    start         timeout=240
    promote       timeout=90
    demote        timeout=90
    notify        timeout=90
    stop          timeout=100
    monitor_Slave timeout=20 interval=20
    monitor_Master timeout=20 interval=10

OK. I have to configure monitor_Slave and monitor_Master.

The log says:
Aug  1 14:19:33 ha-idg-1 drbd(prim_drbd_idcc_devel)[11325]: ERROR: meta parameter misconfigured, expected clone-max -le 2, but found unset.
                                                                                                          ^^^^^^^^^
Aug  1 14:19:33 ha-idg-1 crmd[4692]:   notice: process_lrm_event: Operation prim_drbd_idcc_devel_monitor_0: not configured (node=ha-idg-1, call=73, rc=6, cib-update=37, confirmed=true)
Aug  1 14:19:33 ha-idg-1 crmd[4692]:   notice: process_lrm_event: Operation prim_drbd_idcc_devel_stop_0: not configured (node=ha-idg-1, call=74, rc=6, cib-update=38, confirmed=true)

Why is it complaining about missing clone-max ? This is a meta attribute for a clone, but not for a simple resource !?! This message is constantly repeated, it still appears although cluster is in standby since three days.
And why does it complain that stop is not configured ?
Isn't that configured with the default of 20 sec. ? That's what crm said. See above. This message is also repeated nearly 7000 times in 9 minutes.
If the stop op is not configured and the cluster complains about it, why does it not complain about a unconfigured start op ?
That the missing monitor is complained is clear.

The DC says:
Aug  1 14:19:33 ha-idg-2 pengine[27043]:  warning: unpack_rsc_op_failure: Processing failed op stop for prim_drbd_idcc_devel on ha-idg-1: not configured (6)
Aug  1 14:19:33 ha-idg-2 pengine[27043]:    error: unpack_rsc_op: Preventing prim_drbd_idcc_devel from re-starting anywhere: operation stop failed 'not configured' (6)

Again complaining about a failed stop, saying it's not configured. Or does it complain that the fail of a stop op is not configured ?
The doc says:
"Some operations are generated by the cluster itself, for example, stopping and starting resources as needed."
http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/_resource_operations.html . Is the doc wrong ?
What happens when i DON'T configure start/stop operations ? Are they configured automatically ?
I have several primitives without a configured start/stop operation, but never had any problems with them.

failcount is direct INFINITY:
Aug  1 14:19:33 ha-idg-1 attrd[4690]:   notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-prim_drbd_idcc_devel (INFINITY)
Aug  1 14:19:33 ha-idg-1 attrd[4690]:   notice: attrd_perform_update: Sent update 8: fail-count-prim_drbd_idcc_devel=INFINITY

After exact 9 minutes the complaints about the not configured stop operation stopped, the complaints about missing clone-max still appears, although both nodes are in standby

now fail-count is 1 million:
Aug  1 14:28:33 ha-idg-1 attrd[4690]:   notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-prim_drbd_idcc_devel (1000000)
Aug  1 14:28:33 ha-idg-1 attrd[4690]:   notice: attrd_perform_update: Sent update 7076: fail-count-prim_drbd_idcc_devel=1000000

and a complain about monitor operation appeared again:
Aug  1 14:28:33 ha-idg-1 crmd[4692]:   notice: process_lrm_event: Operation prim_drbd_idcc_devel_monitor_60000: not configured (node=ha-idg-1, call=6968, rc=6, cib-update=6932, confirmed=false)
Aug  1 14:28:33 ha-idg-1 attrd[4690]:   notice: attrd_cs_dispatch: Update relayed from ha-idg-2

crm_mon said:
Failed actions:
    prim_drbd_idcc_devel_stop_0 on ha-idg-1 'not configured' (6): call=6967, status=complete, exit-reason='none', last-rc-change='Tue Aug  1 14:28:33 2017', queued=0ms, exec=41ms
    prim_drbd_idcc_devel_monitor_60000 on ha-idg-1 'not configured' (6): call=6968, status=complete, exit-reason='none', last-rc-change='Tue Aug  1 14:28:33 2017', queued=0ms, exec=41ms
    prim_drbd_idcc_devel_stop_0 on ha-idg-2 'not configured' (6): call=6963, status=complete, exit-reason='none', last-rc-change='Tue Aug  1 14:28:33 2017', queued=0ms, exec=40ms

A big problem was that i have a ClusterMon resource running on each node. It triggered about 20000 snmp traps in 193 seconds to my management station, which triggered 20000 e-Mails ...
>From where comes this incredible amount of traps ? Nearly all traps said that stop is not configured for the drdb resource. Why complaining so often ? And why stopping after ~20.000 traps ?
And complaining about not configured monitor operation just 8 times.

Btw: is there a history like in the bash where i see which crm command i entered at which time ? I know that crm history is mighty, but didn't find that.

Bernd

-- 
Bernd Lentes 

Systemadministration 
institute of developmental genetics 
Gebäude 35.34 - Raum 208 
HelmholtzZentrum München 
bernd.lentes at helmholtz-muenchen.de 
phone: +49 (0)89 3187 1241 
fax: +49 (0)89 3187 2294 

no backup - no mercy

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir'in Baerbel Brumme-Bothe
Geschaeftsfuehrer: Prof. Dr. Guenther Wess, Heinrich Bassler, Dr. Alfons Enhsen
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671