[ClusterLabs] big trouble with a DRBD resource
Lentes, Bernd
bernd.lentes at helmholtz-muenchen.de
Fri Aug 4 18:20:22 CEST 2017
Hi,
first: is there a tutorial or s.th. else which helps in understanding what pacemaker logs in syslog and /var/log/cluster/corosync.log ?
I try hard to find out what's going wrong, but they are difficult to understand, also because of the amount of information.
Or should i deal more with "crm histroy" or hb_report ?
What happened:
I tried to configure a simple drbd resource following http://clusterlabs.org/doc/en-US/Pacemaker/1.1-plugin/html-single/Clusters_from_Scratch/index.html#idm140457860751296
I used this simple snip from the doc:
configure primitive WebData ocf:linbit:drbd params drbd_resource=wwwdata \
op monitor interval=60s
I did it on live cluster, which is in testing currently. I will never do this again. Shadow will be my friend.
The cluster reacted promptly:
crm(live)# configure primitive prim_drbd_idcc_devel ocf:linbit:drbd params drbd_resource=idcc-devel \
> op monitor interval=60
WARNING: prim_drbd_idcc_devel: default timeout 20s for start is smaller than the advised 240
WARNING: prim_drbd_idcc_devel: default timeout 20s for stop is smaller than the advised 100
WARNING: prim_drbd_idcc_devel: action monitor not advertised in meta-data, it may not be supported by the RA
>From what i understand until now is that i didn't configure start/stop operations, so the cluster chooses the default from default-action-timeout.
It didn't configure the monitor operation, because this is not in the meta-data.
I checked it:
crm(live)# ra info ocf:linbit:drbd
Manages a DRBD device as a Master/Slave resource (ocf:linbit:drbd)
Operations' defaults (advisory minimum):
start timeout=240
promote timeout=90
demote timeout=90
notify timeout=90
stop timeout=100
monitor_Slave timeout=20 interval=20
monitor_Master timeout=20 interval=10
OK. I have to configure monitor_Slave and monitor_Master.
The log says:
Aug 1 14:19:33 ha-idg-1 drbd(prim_drbd_idcc_devel)[11325]: ERROR: meta parameter misconfigured, expected clone-max -le 2, but found unset.
^^^^^^^^^
Aug 1 14:19:33 ha-idg-1 crmd[4692]: notice: process_lrm_event: Operation prim_drbd_idcc_devel_monitor_0: not configured (node=ha-idg-1, call=73, rc=6, cib-update=37, confirmed=true)
Aug 1 14:19:33 ha-idg-1 crmd[4692]: notice: process_lrm_event: Operation prim_drbd_idcc_devel_stop_0: not configured (node=ha-idg-1, call=74, rc=6, cib-update=38, confirmed=true)
Why is it complaining about missing clone-max ? This is a meta attribute for a clone, but not for a simple resource !?! This message is constantly repeated, it still appears although cluster is in standby since three days.
And why does it complain that stop is not configured ?
Isn't that configured with the default of 20 sec. ? That's what crm said. See above. This message is also repeated nearly 7000 times in 9 minutes.
If the stop op is not configured and the cluster complains about it, why does it not complain about a unconfigured start op ?
That the missing monitor is complained is clear.
The DC says:
Aug 1 14:19:33 ha-idg-2 pengine[27043]: warning: unpack_rsc_op_failure: Processing failed op stop for prim_drbd_idcc_devel on ha-idg-1: not configured (6)
Aug 1 14:19:33 ha-idg-2 pengine[27043]: error: unpack_rsc_op: Preventing prim_drbd_idcc_devel from re-starting anywhere: operation stop failed 'not configured' (6)
Again complaining about a failed stop, saying it's not configured. Or does it complain that the fail of a stop op is not configured ?
The doc says:
"Some operations are generated by the cluster itself, for example, stopping and starting resources as needed."
http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/_resource_operations.html . Is the doc wrong ?
What happens when i DON'T configure start/stop operations ? Are they configured automatically ?
I have several primitives without a configured start/stop operation, but never had any problems with them.
failcount is direct INFINITY:
Aug 1 14:19:33 ha-idg-1 attrd[4690]: notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-prim_drbd_idcc_devel (INFINITY)
Aug 1 14:19:33 ha-idg-1 attrd[4690]: notice: attrd_perform_update: Sent update 8: fail-count-prim_drbd_idcc_devel=INFINITY
After exact 9 minutes the complaints about the not configured stop operation stopped, the complaints about missing clone-max still appears, although both nodes are in standby
now fail-count is 1 million:
Aug 1 14:28:33 ha-idg-1 attrd[4690]: notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-prim_drbd_idcc_devel (1000000)
Aug 1 14:28:33 ha-idg-1 attrd[4690]: notice: attrd_perform_update: Sent update 7076: fail-count-prim_drbd_idcc_devel=1000000
and a complain about monitor operation appeared again:
Aug 1 14:28:33 ha-idg-1 crmd[4692]: notice: process_lrm_event: Operation prim_drbd_idcc_devel_monitor_60000: not configured (node=ha-idg-1, call=6968, rc=6, cib-update=6932, confirmed=false)
Aug 1 14:28:33 ha-idg-1 attrd[4690]: notice: attrd_cs_dispatch: Update relayed from ha-idg-2
crm_mon said:
Failed actions:
prim_drbd_idcc_devel_stop_0 on ha-idg-1 'not configured' (6): call=6967, status=complete, exit-reason='none', last-rc-change='Tue Aug 1 14:28:33 2017', queued=0ms, exec=41ms
prim_drbd_idcc_devel_monitor_60000 on ha-idg-1 'not configured' (6): call=6968, status=complete, exit-reason='none', last-rc-change='Tue Aug 1 14:28:33 2017', queued=0ms, exec=41ms
prim_drbd_idcc_devel_stop_0 on ha-idg-2 'not configured' (6): call=6963, status=complete, exit-reason='none', last-rc-change='Tue Aug 1 14:28:33 2017', queued=0ms, exec=40ms
A big problem was that i have a ClusterMon resource running on each node. It triggered about 20000 snmp traps in 193 seconds to my management station, which triggered 20000 e-Mails ...
>From where comes this incredible amount of traps ? Nearly all traps said that stop is not configured for the drdb resource. Why complaining so often ? And why stopping after ~20.000 traps ?
And complaining about not configured monitor operation just 8 times.
Btw: is there a history like in the bash where i see which crm command i entered at which time ? I know that crm history is mighty, but didn't find that.
Bernd
--
Bernd Lentes
Systemadministration
institute of developmental genetics
Gebäude 35.34 - Raum 208
HelmholtzZentrum München
bernd.lentes at helmholtz-muenchen.de
phone: +49 (0)89 3187 1241
fax: +49 (0)89 3187 2294
no backup - no mercy
Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir'in Baerbel Brumme-Bothe
Geschaeftsfuehrer: Prof. Dr. Guenther Wess, Heinrich Bassler, Dr. Alfons Enhsen
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671
More information about the Users
mailing list