[ClusterLabs] query on pacemaker monitor timeout

Fri Dec 11 13:49:18 EST 2020

On Thu, 2020-12-10 at 17:53 +0000, S Sathish S wrote:
> Hi Team,
> 
> Problem Statement:
>  
> pcs resource monitor got timed out after 120000ms and tried to
> recover resource(application) by stopping and starting first
> occurrence itself. Due to this restart resource which caused traffic
> impact momently in their environment. And,  we suspect reason for
> timed out is at time of monitor function execution, process checks
> command got hanged & delayed due to system resource unavailability.
> 
> We are not able to confirm exactly same thing is happened or not.  we
> got only info like HAZELCAST_occ12_monitor_10000:59159 - terminated
> with signal 9.
>  
> Error message we seen in customer node:
>  
> zrangun at seliius25303[16:40][var/log/pacemaker]$ grep -ia HAZELCAST
> pacemaker.log
> Nov 15 22:42:33 occ12 pacemaker-execd     [2796]
> (child_timeout_callback)       warning: HAZELCAST_occ12_monitor_10000
> process (PID 57827) timed out
> Nov 15 22:42:33 occ12 pacemaker-execd     [2796]
> (operation_finished)   warning: HAZELCAST_occ12_monitor_10000:57827 -
> timed out after 120000ms
> Nov 15 22:42:47 occ12 pacemaker-execd     [2796]
> (cancel_recurring_action)      info: Cancelling ocf operation
> HAZELCAST_occ12_monitor_10000
> Nov 15 22:42:47 occ12 pacemaker-execd     [2796]
> (services_action_cancel)       info: Terminating in-flight op
> HAZELCAST_occ12_monitor_10000 (pid 59159) early because it was
> cancelled
> Nov 15 22:42:47 occ12 pacemaker-execd     [2796]
> (operation_finished)   info: HAZELCAST_occ12_monitor_10000:59159 -
> terminated with signal 9
> Nov 15 22:42:47 occ12 pacemaker-execd     [2796]
> (cancel_recurring_action)      info: Cancelling ocf operation
> HAZELCAST_occ12_monitor_10000
> Nov 15 22:42:47 occ12 pacemaker-execd     [2796] (log_execute)  info:
>  executing - rsc:HAZELCAST_occ12 action:stop call_id:391
> Nov 15 22:43:41 occ12 pacemaker-execd     [2796]
> (log_finished)         info: finished - rsc:HAZELCAST_occ12
> action:stop call_id:391 pid:59476 exit-code:0 exec-time:53623ms
> queue-time:0ms
> Nov 15 22:43:42 occ12 pacemaker-execd     [2796] (log_execute)  info:
> executing - rsc:HAZELCAST_occ12 action:start call_id:392
> Nov 15 22:43:46 occ12 pacemaker-execd     [2796]
> (operation_finished)   notice: HAZELCAST_occ12_start_0:61681:stderr [
> touch: cannot touch '/usr/var/run/resource-agents/hazelcast-
> HAZELCAST_occ12.state': No such file or directory ]
> Nov 15 22:43:46 occ12 pacemaker-execd     [2796]
> (log_finished)         info: finished - rsc:HAZELCAST_occ12
> action:start call_id:392 pid:61681 exit-code:1 exec-time:3525ms
> queue-time:0ms
> Nov 15 22:43:46 occ12 pacemaker-execd     [2796] (log_execute)  info:
> executing - rsc:HAZELCAST_occ12 action:stop call_id:393
> Nov 15 22:43:47 occ12 pacemaker-execd     [2796]
> (log_finished)         info: finished - rsc:HAZELCAST_occ12
> action:stop call_id:393 pid:64134 exit-code:0 exec-time:695ms queue-
> time:0ms
> Nov 15 22:43:50 occ12 pacemaker-execd     [2796] (log_execute)  info:
> executing - rsc:HAZELCAST_occ12 action:start call_id:394
> Nov 15 22:45:15 occ12 pacemaker-execd     [2796]
> (log_finished)         info: finished - rsc:HAZELCAST_occ12
> action:start call_id:394 pid:64410 exit-code:0 exec-time:85211ms
> queue-time:1ms
>  
> We have shared resource configuration setup and dummy_monitor
> function from local node FYR.  
>   
> Resource setup:
>  
> [root at vmc0137 ~]# pcs resource show HAZELCAST_vmc0137
> Resource: HAZELCAST_vmc0137 (class=ocf provider=provider
> type=HazelCast_RA)
>   Meta Attrs: failure-timeout=120s migration-threshold=5 priority=50
>   Operations: migrate_from interval=0s timeout=20 (HAZELCAST_vmc0137-
> migrate_from-interval-0s)
>               migrate_to interval=0s timeout=20 (HAZELCAST_vmc0137-
> migrate_to-interval-0s)
>               monitor interval=10s on-fail=restart timeout=120s
> (HAZELCAST_vmc0137-monitor-interval-10s)
>               reload interval=0s timeout=20 (HAZELCAST_vmc0137-
> reload-interval-0s)
>               start interval=0s on-fail=restart timeout=120s
> (HAZELCAST_vmc0137-start-interval-0s)
>               stop interval=0s timeout=120s (HAZELCAST_vmc0137-stop-
> interval-0s)
> 
> Monitor function input:
> 
> dummy_monitor() {
>         # Monitor _MUST!_ differentiate correctly between running
>         # (SUCCESS), failed (ERROR) or _cleanly_ stopped (NOT
> RUNNING).
>         # That is THREE states, not just yes/no
>         #sleep ${OCF_RESKEY_op_sleep}
>  
> output=$(su - ogw -c "/opt/occ/$PRODUCT_NUMBER/bin/RCControl status
> SERVER")
> number=$(grep "Running as PID" <<< "$output" | wc -l)
> PID=`pgrep -f "Dcmg.component.name=SERVER"`
>  
>      if [ $number == 1 ] || [ -n "$PID" ] ; then
>         if [ ! -f /opt/occ/var/pid/SERVER.`hostname`.pid ]; then
>             NOW=$(date +"%b %d %H:%M:%S")
>             echo "$PID" > /opt/occ/var/pid/SERVER.`hostname`.pid
>             chown ogw:med /opt/occ/var/pid/SERVER.`hostname`.pid
>             echo "$NOW Monitor found SERVER pid file not exist and
> going to create it" >>/var/log/cluster/corosync.log
>         fi
>          return $OCF_SUCCESS
>      fi
>      NOW=$(date +"%b %d %H:%M:%S")
>      echo "$NOW Monitor found SERVER component is not running and
> going for the restart" >>/var/log/cluster/corosync.log
>      return $OCF_NOT_RUNNING
>  
> }
>  
> So, we need to support and answer to avoid above scenarios in future
> , kindly let us know if any additional logs required. 
> 
>                 1) Is there any options available to set fail-retry
> conditions for resource monitor?  So, if two times resource monitor
> fails continuously, then only it should go for recover. Other wise
> pacemaker should initiate recover for the resource. Please confirm. 

Not currently, but there is a proposal to support that. Basically it's
just a matter of developer time being unavailable.

>                 2)  Is there any other better option available to
> avoid timed out issues in first occurrence itself.?

Only increasing the timeout.

>                 3)  we thought of increasing resource timeout value
> to 300s and adding retry logic in dummy_monitor function itself on
> the RA files with timeout command. So, In this case pgrep command
> will be killed if couldn’t get response within 30s and retry with
> next loop. Will this solution  help for us..?

Yes, that should work.

> 
> Thanks & Regards,
> S Sathish S
-- 
Ken Gaillot <kgaillot at redhat.com>