[ClusterLabs] query on pacemaker monitor timeout

Thu Dec 10 12:53:58 EST 2020

Hi Team,
Problem Statement:

pcs resource monitor got timed out after 120000ms and tried to recover resource(application) by stopping and starting first occurrence itself. Due to this restart resource which caused traffic impact momently in their environment. And, we suspect reason for timed out is at time of monitor function execution, process checks command got hanged & delayed due to system resource unavailability.

We are not able to confirm exactly same thing is happened or not.  we got only info like HAZELCAST_occ12_monitor_10000:59159 - terminated with signal 9.

Error message we seen in customer node:

zrangun at seliius25303[16:40][var/log/pacemaker]$ grep -ia HAZELCAST pacemaker.log
Nov 15 22:42:33 occ12 pacemaker-execd     [2796] (child_timeout_callback)       warning: HAZELCAST_occ12_monitor_10000 process (PID 57827) timed out
Nov 15 22:42:33 occ12 pacemaker-execd     [2796] (operation_finished)   warning: HAZELCAST_occ12_monitor_10000:57827 - timed out after 120000ms
Nov 15 22:42:47 occ12 pacemaker-execd     [2796] (cancel_recurring_action)      info: Cancelling ocf operation HAZELCAST_occ12_monitor_10000
Nov 15 22:42:47 occ12 pacemaker-execd     [2796] (services_action_cancel)       info: Terminating in-flight op HAZELCAST_occ12_monitor_10000 (pid 59159) early because it was cancelled
Nov 15 22:42:47 occ12 pacemaker-execd     [2796] (operation_finished)   info: HAZELCAST_occ12_monitor_10000:59159 - terminated with signal 9
Nov 15 22:42:47 occ12 pacemaker-execd     [2796] (cancel_recurring_action)      info: Cancelling ocf operation HAZELCAST_occ12_monitor_10000
Nov 15 22:42:47 occ12 pacemaker-execd     [2796] (log_execute)  info:  executing - rsc:HAZELCAST_occ12 action:stop call_id:391
Nov 15 22:43:41 occ12 pacemaker-execd     [2796] (log_finished)         info: finished - rsc:HAZELCAST_occ12 action:stop call_id:391 pid:59476 exit-code:0 exec-time:53623ms queue-time:0ms
Nov 15 22:43:42 occ12 pacemaker-execd     [2796] (log_execute)  info: executing - rsc:HAZELCAST_occ12 action:start call_id:392
Nov 15 22:43:46 occ12 pacemaker-execd     [2796] (operation_finished)   notice: HAZELCAST_occ12_start_0:61681:stderr [ touch: cannot touch '/usr/var/run/resource-agents/hazelcast-HAZELCAST_occ12.state': No such file or directory ]
Nov 15 22:43:46 occ12 pacemaker-execd     [2796] (log_finished)         info: finished - rsc:HAZELCAST_occ12 action:start call_id:392 pid:61681 exit-code:1 exec-time:3525ms queue-time:0ms
Nov 15 22:43:46 occ12 pacemaker-execd     [2796] (log_execute)  info: executing - rsc:HAZELCAST_occ12 action:stop call_id:393
Nov 15 22:43:47 occ12 pacemaker-execd     [2796] (log_finished)         info: finished - rsc:HAZELCAST_occ12 action:stop call_id:393 pid:64134 exit-code:0 exec-time:695ms queue-time:0ms
Nov 15 22:43:50 occ12 pacemaker-execd     [2796] (log_execute)  info: executing - rsc:HAZELCAST_occ12 action:start call_id:394
Nov 15 22:45:15 occ12 pacemaker-execd     [2796] (log_finished)         info: finished - rsc:HAZELCAST_occ12 action:start call_id:394 pid:64410 exit-code:0 exec-time:85211ms queue-time:1ms

We have shared resource configuration setup and dummy_monitor function from local node FYR.

Resource setup:

[root at vmc0137 ~]# pcs resource show HAZELCAST_vmc0137
Resource: HAZELCAST_vmc0137 (class=ocf provider=provider type=HazelCast_RA)
  Meta Attrs: failure-timeout=120s migration-threshold=5 priority=50
  Operations: migrate_from interval=0s timeout=20 (HAZELCAST_vmc0137-migrate_from-interval-0s)
              migrate_to interval=0s timeout=20 (HAZELCAST_vmc0137-migrate_to-interval-0s)
              monitor interval=10s on-fail=restart timeout=120s (HAZELCAST_vmc0137-monitor-interval-10s)
              reload interval=0s timeout=20 (HAZELCAST_vmc0137-reload-interval-0s)
              start interval=0s on-fail=restart timeout=120s (HAZELCAST_vmc0137-start-interval-0s)
              stop interval=0s timeout=120s (HAZELCAST_vmc0137-stop-interval-0s)

Monitor function input:

dummy_monitor() {
        # Monitor _MUST!_ differentiate correctly between running
        # (SUCCESS), failed (ERROR) or _cleanly_ stopped (NOT RUNNING).
        # That is THREE states, not just yes/no
        #sleep ${OCF_RESKEY_op_sleep}

output=$(su - ogw -c "/opt/occ/$PRODUCT_NUMBER/bin/RCControl status SERVER")
number=$(grep "Running as PID" <<< "$output" | wc -l)
PID=`pgrep -f "Dcmg.component.name<https://protect2.fireeye.com/v1/url?k=c0fcaeb7-9f679786-c0fcee2c-86e2237f51fb-f9fa99fdfd024a85&q=1&e=6f1aa0d1-5fc8-4c2c-a73e-983c5f3bfab6&u=http%3A%2F%2Fdcmg.component.name%2F>=SERVER"`

     if [ $number == 1 ] || [ -n "$PID" ] ; then
        if [ ! -f /opt/occ/var/pid/SERVER.`hostname`.pid ]; then
            NOW=$(date +"%b %d %H:%M:%S")
            echo "$PID" > /opt/occ/var/pid/SERVER.`hostname`.pid
            chown ogw:med /opt/occ/var/pid/SERVER.`hostname`.pid
            echo "$NOW Monitor found SERVER pid file not exist and going to create it" >>/var/log/cluster/corosync.log
        fi
         return $OCF_SUCCESS
     fi
     NOW=$(date +"%b %d %H:%M:%S")
     echo "$NOW Monitor found SERVER component is not running and going for the restart" >>/var/log/cluster/corosync.log
     return $OCF_NOT_RUNNING

}

So, we need to support and answer to avoid above scenarios in future , kindly let us know if any additional logs required.

                1) Is there any options available to set fail-retry conditions for resource monitor?  So, if two times resource monitor fails continuously, then only it should go for recover. Other wise pacemaker should initiate recover for the resource. Please confirm.
                2)  Is there any other better option available to avoid timed out issues in first occurrence itself.?
                3)  we thought of increasing resource timeout value to 300s and adding retry logic in dummy_monitor function itself on the RA files with timeout command. So, In this case pgrep command will be killed if couldn’t get response within 30s and retry with next loop. Will this solution  help for us..?

Thanks & Regards,
S Sathish S
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20201210/3524cb72/attachment-0001.htm>