[ClusterLabs] Resource monitors crash, restart, leave core files
Ken Gaillot
kgaillot at redhat.com
Thu Mar 5 12:21:39 EST 2020
On Thu, 2020-03-05 at 13:14 +0000, Jaap Winius wrote:
> Hi folks,
>
> My test system, which includes support for a filesystem resource
> called 'mount', works fine otherwise, but every day or so I see
> monitor errors like the following when I run 'pcs status':
>
> Failed Resource Actions:
> * mount_monitor_20000 on bd3c7 'unknown error' (1): call=23,
> status=Error, exitreason='',
> last-rc-change='Thu Mar 5 04:57:55 2020', queued=0ms,
> exec=0ms
>
> The corosync.log shows some more information (see log fragments
> below), but I'm unable to identify a cause. The resource monitor
> bombs
> out, produces a core dump and then starts up again about 2 seconds
> later. I've also seen this happen with the monitor for my nfsserver
> resource. Other than that it stops for a few seconds, the other
> problem is that this will eventually cause the filesystem with the
> ./pacemaker/cores/ directory to fill up with core files (so far,
> each
> is less than 1MB).
>
> Could this be a bug, or is my software not configured correctly
> (see
> cfg below)?
>
> Thanks,
>
> Jaap
>
> PS -- I'm using CentOS 7.7.1908, Corosync 2.4.3, Pacemaker 1.1.20,
> PCS
> 0.9.167 and DRBD 9.10.0.
>
> ################# corosync.log #########
>
> Mar 05 04:57:55 [15652] bd3c7.umrk.nl lrmd: error:
> child_waitpid: Managed process 22553 (mount_monitor_20000)
> dumped
> core
This would have to be a bug in the resource agent. I'd build it with
debug symbols to get a backtrace from the core.
> Mar 05 04:57:55 [15652] bd3c7.umrk.nl lrmd: warning:
> operation_finished: mount_monitor_20000:22553 - terminated with
> signal
> 11
> Mar 05 04:57:55 [15655] bd3c7.umrk.nl crmd: error:
> process_lrm_event: Result of monitor operation for mount on bd3c7:
> Error | call=23 key=mount_monitor_20000 confirmed=false status=4
> cib-update=143
> ...
> Mar 05 04:57:55 [15655] bd3c7.umrk.nl crmd: info:
> abort_transition_graph: Transition aborted by operation
> mount_monitor_20000 'create' on bd3c7: Old event |
> magic=4:1;40:2:0:37dad885-d4be-4dcd-8d5f-fd9663e9f953 cib=0.22.62
> source=process_graph_event:499 complete=true
> ...
> Mar 05 04:57:55 [15655] bd3c7.umrk.nl crmd: info:
> process_graph_event: Detected action (2.40)
> mount_monitor_20000.23=unknown error: failed
> ...
> Mar 05 04:57:56 [15652] bd3c7.umrk.nl lrmd: info:
> cancel_recurring_action: Cancelling ocf operation
> mount_monitor_20000
> ...
> Mar 05 04:57:57 [15655] bd3c7.umrk.nl crmd: notice:
> te_rsc_command: Initiating monitor operation
> mount_monitor_20000
> locally on bd3c7 | action 1
> Mar 05 04:57:57 [15655] bd3c7.umrk.nl crmd: info:
> do_lrm_rsc_op: Performing
> key=1:71:0:37dad885-d4be-4dcd-8d5f-fd9663e9f953
> op=mount_monitor_20000
> ...
> Mar 05 04:57:57 [15650] bd3c7.umrk.nl cib: info:
> cib_perform_op: +
> /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resour
> ce[@id='mount']/lrm_rsc_op[@id='mount_monitor_20000']: @transition-
> key=1:71:0:37dad885-d4be-4dcd-8d5f-fd9663e9f953, @transition-magic=-
> 1:193;1:71:0:37dad885-d4be-4dcd-8d5f-fd9663e9f953, @call-id=-1, @rc-
> code=193, @op-status=-1, @last-rc-change=1583380677,
> @exec-time=0
> ...
> Mar 05 04:57:57 [15655] bd3c7.umrk.nl crmd: info:
> process_lrm_event: Result of monitor operation for mount on bd3c7:
> 0
> (ok) | call=51 key=mount_monitor_20000 confirmed=false cib-update=159
> ...
> Mar 05 04:57:57 [15650] bd3c7.umrk.nl cib: info:
> cib_perform_op: +
> /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resour
> ce[@id='mount']/lrm_rsc_op[@id='mount_monitor_20000']: @transition-
> magic=0:0;1:71:0:37dad885-d4be-4dcd-8d5f-fd9663e9f953, @call-id=51,
> @rc-code=0, @op-status=0,
> @exec-time=70
> Mar 05 04:57:57 [15650] bd3c7.umrk.nl cib: info:
> cib_process_request: Completed cib_modify operation for
> section
> status: OK (rc=0, origin=bd3c7/crmd/159, version=0.22.77)
> Mar 05 04:57:57 [15655] bd3c7.umrk.nl crmd: info:
> match_graph_event: Action mount_monitor_20000 (1) confirmed on
> bd3c7
> (rc=0)
>
> ########################################
>
> ################# Pacemaker cfg ########
>
> ~# pcs resource defaults resource-stickiness=100 ; \
> pcs resource create drbd ocf:linbit:drbd drbd_resource=r0 op
> monitor interval=60s ; \
> pcs resource master drbd master-max=1 master-node-max=1
> clone-max=2 clone-node-max=1 notify=true ; \
> pcs resource create mount Filesystem device="/dev/drbd0"
> directory="/data" fstype="ext4" ; \
> pcs constraint colocation add mount with drbd-master
> INFINITY
> with-rsc-role=Master ; \
> pcs constraint order promote drbd-master then mount ; \
> pcs resource create vip ocf:heartbeat:IPaddr2
> ip=192.168.2.73
> cidr_netmask=24 op monitor interval=30s ; \
> pcs constraint colocation add vip with drbd-master INFINITY
> with-rsc-role=Master ; \
> pcs constraint order mount then vip ; \
> pcs resource create nfsd nfsserver nfs_shared_infodir=/data ;
> \
> pcs resource create nfscfg exportfs
> clientspec="192.168.2.55"
> options=rw,no_subtree_check,no_root_squash directory=/data fsid=0 ; \
> pcs constraint colocation add nfsd with vip ; \
> pcs constraint colocation add nfscfg with nfsd ; \
> pcs constraint order vip then nfsd ; \
> pcs constraint order nfsd then nfscfg
>
> ########################################
>
>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
--
Ken Gaillot <kgaillot at redhat.com>
More information about the Users
mailing list