[ClusterLabs] Resource monitors crash, restart, leave core files

Thu Mar 5 12:21:39 EST 2020

On Thu, 2020-03-05 at 13:14 +0000, Jaap Winius wrote:
> Hi folks,
> 
> My test system, which includes support for a filesystem resource  
> called 'mount', works fine otherwise, but every day or so I see  
> monitor errors like the following when I run 'pcs status':
> 
>    Failed Resource Actions:
>    * mount_monitor_20000 on bd3c7 'unknown error' (1): call=23,  
> status=Error, exitreason='',
>         last-rc-change='Thu Mar  5 04:57:55 2020', queued=0ms,
> exec=0ms
> 
> The corosync.log shows some more information (see log fragments  
> below), but I'm unable to identify a cause. The resource monitor
> bombs  
> out, produces a core dump and then starts up again about 2 seconds  
> later. I've also seen this happen with the monitor for my nfsserver  
> resource. Other than that it stops for a few seconds, the other  
> problem is that this will eventually cause the filesystem with the  
> ./pacemaker/cores/ directory to fill up with core files (so far,
> each  
> is less than 1MB).
> 
> Could this be a bug, or is my software not configured correctly
> (see  
> cfg below)?
> 
> Thanks,
> 
> Jaap
> 
> PS -- I'm using CentOS 7.7.1908, Corosync 2.4.3, Pacemaker 1.1.20,
> PCS  
> 0.9.167 and DRBD 9.10.0.
> 
> ################# corosync.log #########
> 
> Mar 05 04:57:55 [15652] bd3c7.umrk.nl       lrmd:    error:  
> child_waitpid:      Managed process 22553 (mount_monitor_20000)
> dumped  
> core

This would have to be a bug in the resource agent. I'd build it with
debug symbols to get a backtrace from the core.

> Mar 05 04:57:55 [15652] bd3c7.umrk.nl       lrmd:  warning:  
> operation_finished: mount_monitor_20000:22553 - terminated with
> signal  
> 11
> Mar 05 04:57:55 [15655] bd3c7.umrk.nl       crmd:    error:  
> process_lrm_event:  Result of monitor operation for mount on bd3c7:  
> Error | call=23 key=mount_monitor_20000 confirmed=false status=4  
> cib-update=143
> ...
> Mar 05 04:57:55 [15655] bd3c7.umrk.nl       crmd:     info:  
> abort_transition_graph:     Transition aborted by operation  
> mount_monitor_20000 'create' on bd3c7: Old event |  
> magic=4:1;40:2:0:37dad885-d4be-4dcd-8d5f-fd9663e9f953 cib=0.22.62  
> source=process_graph_event:499 complete=true
> ...
> Mar 05 04:57:55 [15655] bd3c7.umrk.nl       crmd:     info:  
> process_graph_event:        Detected action (2.40)  
> mount_monitor_20000.23=unknown error: failed
> ...
> Mar 05 04:57:56 [15652] bd3c7.umrk.nl       lrmd:     info:  
> cancel_recurring_action:    Cancelling ocf operation
> mount_monitor_20000
> ...
> Mar 05 04:57:57 [15655] bd3c7.umrk.nl       crmd:   notice:  
> te_rsc_command:     Initiating monitor operation
> mount_monitor_20000  
> locally on bd3c7 | action 1
> Mar 05 04:57:57 [15655] bd3c7.umrk.nl       crmd:     info:  
> do_lrm_rsc_op:      Performing  
> key=1:71:0:37dad885-d4be-4dcd-8d5f-fd9663e9f953
> op=mount_monitor_20000
> ...
> Mar 05 04:57:57 [15650] bd3c7.umrk.nl        cib:     info:  
> cib_perform_op:     +   
> /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resour
> ce[@id='mount']/lrm_rsc_op[@id='mount_monitor_20000']:  @transition-
> key=1:71:0:37dad885-d4be-4dcd-8d5f-fd9663e9f953, @transition-magic=-
> 1:193;1:71:0:37dad885-d4be-4dcd-8d5f-fd9663e9f953, @call-id=-1, @rc-
> code=193, @op-status=-1, @last-rc-change=1583380677,  
> @exec-time=0
> ...
> Mar 05 04:57:57 [15655] bd3c7.umrk.nl       crmd:     info:  
> process_lrm_event:  Result of monitor operation for mount on bd3c7:
> 0  
> (ok) | call=51 key=mount_monitor_20000 confirmed=false cib-update=159
> ...
> Mar 05 04:57:57 [15650] bd3c7.umrk.nl        cib:     info:  
> cib_perform_op:     +   
> /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resour
> ce[@id='mount']/lrm_rsc_op[@id='mount_monitor_20000']:  @transition-
> magic=0:0;1:71:0:37dad885-d4be-4dcd-8d5f-fd9663e9f953, @call-id=51,
> @rc-code=0, @op-status=0,  
> @exec-time=70
> Mar 05 04:57:57 [15650] bd3c7.umrk.nl        cib:     info:  
> cib_process_request:        Completed cib_modify operation for
> section  
> status: OK (rc=0, origin=bd3c7/crmd/159, version=0.22.77)
> Mar 05 04:57:57 [15655] bd3c7.umrk.nl       crmd:     info:  
> match_graph_event:  Action mount_monitor_20000 (1) confirmed on
> bd3c7  
> (rc=0)
> 
> ########################################
> 
> ################# Pacemaker cfg ########
> 
>     ~# pcs resource defaults resource-stickiness=100 ; \
>        pcs resource create drbd ocf:linbit:drbd drbd_resource=r0 op  
> monitor interval=60s ; \
>        pcs resource master drbd master-max=1 master-node-max=1  
> clone-max=2 clone-node-max=1 notify=true ; \
>        pcs resource create mount Filesystem device="/dev/drbd0"  
> directory="/data" fstype="ext4" ; \
>        pcs constraint colocation add mount with drbd-master
> INFINITY  
> with-rsc-role=Master ; \
>        pcs constraint order promote drbd-master then mount ; \
>        pcs resource create vip ocf:heartbeat:IPaddr2
> ip=192.168.2.73  
> cidr_netmask=24 op monitor interval=30s ; \
>        pcs constraint colocation add vip with drbd-master INFINITY  
> with-rsc-role=Master ; \
>        pcs constraint order mount then vip ; \
>        pcs resource create nfsd nfsserver nfs_shared_infodir=/data ;
> \
>        pcs resource create nfscfg exportfs
> clientspec="192.168.2.55"  
> options=rw,no_subtree_check,no_root_squash directory=/data fsid=0 ; \
>        pcs constraint colocation add nfsd with vip ; \
>        pcs constraint colocation add nfscfg with nfsd ; \
>        pcs constraint order vip then nfsd ; \
>        pcs constraint order nfsd then nfscfg
> 
> ########################################
> 
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
> 
-- 
Ken Gaillot <kgaillot at redhat.com>