[ClusterLabs] Resource monitors crash, restart, leave core files

Thu Mar 5 08:14:53 EST 2020

Hi folks,

My test system, which includes support for a filesystem resource  
called 'mount', works fine otherwise, but every day or so I see  
monitor errors like the following when I run 'pcs status':

   Failed Resource Actions:
   * mount_monitor_20000 on bd3c7 'unknown error' (1): call=23,  
status=Error, exitreason='',
        last-rc-change='Thu Mar  5 04:57:55 2020', queued=0ms, exec=0ms

The corosync.log shows some more information (see log fragments  
below), but I'm unable to identify a cause. The resource monitor bombs  
out, produces a core dump and then starts up again about 2 seconds  
later. I've also seen this happen with the monitor for my nfsserver  
resource. Other than that it stops for a few seconds, the other  
problem is that this will eventually cause the filesystem with the  
./pacemaker/cores/ directory to fill up with core files (so far, each  
is less than 1MB).

Could this be a bug, or is my software not configured correctly (see  
cfg below)?

Thanks,

Jaap

PS -- I'm using CentOS 7.7.1908, Corosync 2.4.3, Pacemaker 1.1.20, PCS  
0.9.167 and DRBD 9.10.0.

################# corosync.log #########

Mar 05 04:57:55 [15652] bd3c7.umrk.nl       lrmd:    error:  
child_waitpid:      Managed process 22553 (mount_monitor_20000) dumped  
core
Mar 05 04:57:55 [15652] bd3c7.umrk.nl       lrmd:  warning:  
operation_finished: mount_monitor_20000:22553 - terminated with signal  
11
Mar 05 04:57:55 [15655] bd3c7.umrk.nl       crmd:    error:  
process_lrm_event:  Result of monitor operation for mount on bd3c7:  
Error | call=23 key=mount_monitor_20000 confirmed=false status=4  
cib-update=143
...
Mar 05 04:57:55 [15655] bd3c7.umrk.nl       crmd:     info:  
abort_transition_graph:     Transition aborted by operation  
mount_monitor_20000 'create' on bd3c7: Old event |  
magic=4:1;40:2:0:37dad885-d4be-4dcd-8d5f-fd9663e9f953 cib=0.22.62  
source=process_graph_event:499 complete=true
...
Mar 05 04:57:55 [15655] bd3c7.umrk.nl       crmd:     info:  
process_graph_event:        Detected action (2.40)  
mount_monitor_20000.23=unknown error: failed
...
Mar 05 04:57:56 [15652] bd3c7.umrk.nl       lrmd:     info:  
cancel_recurring_action:    Cancelling ocf operation mount_monitor_20000
...
Mar 05 04:57:57 [15655] bd3c7.umrk.nl       crmd:   notice:  
te_rsc_command:     Initiating monitor operation mount_monitor_20000  
locally on bd3c7 | action 1
Mar 05 04:57:57 [15655] bd3c7.umrk.nl       crmd:     info:  
do_lrm_rsc_op:      Performing  
key=1:71:0:37dad885-d4be-4dcd-8d5f-fd9663e9f953 op=mount_monitor_20000
...
Mar 05 04:57:57 [15650] bd3c7.umrk.nl        cib:     info:  
cib_perform_op:     +   
/cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='mount']/lrm_rsc_op[@id='mount_monitor_20000']:  @transition-key=1:71:0:37dad885-d4be-4dcd-8d5f-fd9663e9f953, @transition-magic=-1:193;1:71:0:37dad885-d4be-4dcd-8d5f-fd9663e9f953, @call-id=-1, @rc-code=193, @op-status=-1, @last-rc-change=1583380677,  
@exec-time=0
...
Mar 05 04:57:57 [15655] bd3c7.umrk.nl       crmd:     info:  
process_lrm_event:  Result of monitor operation for mount on bd3c7: 0  
(ok) | call=51 key=mount_monitor_20000 confirmed=false cib-update=159
...
Mar 05 04:57:57 [15650] bd3c7.umrk.nl        cib:     info:  
cib_perform_op:     +   
/cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='mount']/lrm_rsc_op[@id='mount_monitor_20000']:  @transition-magic=0:0;1:71:0:37dad885-d4be-4dcd-8d5f-fd9663e9f953, @call-id=51, @rc-code=0, @op-status=0,  
@exec-time=70
Mar 05 04:57:57 [15650] bd3c7.umrk.nl        cib:     info:  
cib_process_request:        Completed cib_modify operation for section  
status: OK (rc=0, origin=bd3c7/crmd/159, version=0.22.77)
Mar 05 04:57:57 [15655] bd3c7.umrk.nl       crmd:     info:  
match_graph_event:  Action mount_monitor_20000 (1) confirmed on bd3c7  
(rc=0)

########################################

################# Pacemaker cfg ########

    ~# pcs resource defaults resource-stickiness=100 ; \
       pcs resource create drbd ocf:linbit:drbd drbd_resource=r0 op  
monitor interval=60s ; \
       pcs resource master drbd master-max=1 master-node-max=1  
clone-max=2 clone-node-max=1 notify=true ; \
       pcs resource create mount Filesystem device="/dev/drbd0"  
directory="/data" fstype="ext4" ; \
       pcs constraint colocation add mount with drbd-master INFINITY  
with-rsc-role=Master ; \
       pcs constraint order promote drbd-master then mount ; \
       pcs resource create vip ocf:heartbeat:IPaddr2 ip=192.168.2.73  
cidr_netmask=24 op monitor interval=30s ; \
       pcs constraint colocation add vip with drbd-master INFINITY  
with-rsc-role=Master ; \
       pcs constraint order mount then vip ; \
       pcs resource create nfsd nfsserver nfs_shared_infodir=/data ; \
       pcs resource create nfscfg exportfs clientspec="192.168.2.55"  
options=rw,no_subtree_check,no_root_squash directory=/data fsid=0 ; \
       pcs constraint colocation add nfsd with vip ; \
       pcs constraint colocation add nfscfg with nfsd ; \
       pcs constraint order vip then nfsd ; \
       pcs constraint order nfsd then nfscfg

########################################