[ClusterLabs] query on pacemaker monitor timeout

Mon Dec 14 07:54:22 EST 2020

On 12/11/20 7:49 PM, Ken Gaillot wrote:
> On Thu, 2020-12-10 at 17:53 +0000, S Sathish S wrote:
>> Hi Team,
>>
>> Problem Statement:
>>  
>> pcs resource monitor got timed out after 120000ms and tried to
>> recover resource(application) by stopping and starting first
>> occurrence itself. Due to this restart resource which caused traffic
>> impact momently in their environment. And,  we suspect reason for
>> timed out is at time of monitor function execution, process checks
>> command got hanged & delayed due to system resource unavailability.
>>
>> We are not able to confirm exactly same thing is happened or not.  we
>> got only info like HAZELCAST_occ12_monitor_10000:59159 - terminated
>> with signal 9.
>>  
>> Error message we seen in customer node:
>>  
>> zrangun at seliius25303[16:40][var/log/pacemaker]$ grep -ia HAZELCAST
>> pacemaker.log
>> Nov 15 22:42:33 occ12 pacemaker-execd     [2796]
>> (child_timeout_callback)       warning: HAZELCAST_occ12_monitor_10000
>> process (PID 57827) timed out
>> Nov 15 22:42:33 occ12 pacemaker-execd     [2796]
>> (operation_finished)   warning: HAZELCAST_occ12_monitor_10000:57827 -
>> timed out after 120000ms
>> Nov 15 22:42:47 occ12 pacemaker-execd     [2796]
>> (cancel_recurring_action)      info: Cancelling ocf operation
>> HAZELCAST_occ12_monitor_10000
>> Nov 15 22:42:47 occ12 pacemaker-execd     [2796]
>> (services_action_cancel)       info: Terminating in-flight op
>> HAZELCAST_occ12_monitor_10000 (pid 59159) early because it was
>> cancelled
>> Nov 15 22:42:47 occ12 pacemaker-execd     [2796]
>> (operation_finished)   info: HAZELCAST_occ12_monitor_10000:59159 -
>> terminated with signal 9
>> Nov 15 22:42:47 occ12 pacemaker-execd     [2796]
>> (cancel_recurring_action)      info: Cancelling ocf operation
>> HAZELCAST_occ12_monitor_10000
>> Nov 15 22:42:47 occ12 pacemaker-execd     [2796] (log_execute)  info:
>>  executing - rsc:HAZELCAST_occ12 action:stop call_id:391
>> Nov 15 22:43:41 occ12 pacemaker-execd     [2796]
>> (log_finished)         info: finished - rsc:HAZELCAST_occ12
>> action:stop call_id:391 pid:59476 exit-code:0 exec-time:53623ms
>> queue-time:0ms
>> Nov 15 22:43:42 occ12 pacemaker-execd     [2796] (log_execute)  info:
>> executing - rsc:HAZELCAST_occ12 action:start call_id:392
>> Nov 15 22:43:46 occ12 pacemaker-execd     [2796]
>> (operation_finished)   notice: HAZELCAST_occ12_start_0:61681:stderr [
>> touch: cannot touch '/usr/var/run/resource-agents/hazelcast-
>> HAZELCAST_occ12.state': No such file or directory ]
>> Nov 15 22:43:46 occ12 pacemaker-execd     [2796]
>> (log_finished)         info: finished - rsc:HAZELCAST_occ12
>> action:start call_id:392 pid:61681 exit-code:1 exec-time:3525ms
>> queue-time:0ms
>> Nov 15 22:43:46 occ12 pacemaker-execd     [2796] (log_execute)  info:
>> executing - rsc:HAZELCAST_occ12 action:stop call_id:393
>> Nov 15 22:43:47 occ12 pacemaker-execd     [2796]
>> (log_finished)         info: finished - rsc:HAZELCAST_occ12
>> action:stop call_id:393 pid:64134 exit-code:0 exec-time:695ms queue-
>> time:0ms
>> Nov 15 22:43:50 occ12 pacemaker-execd     [2796] (log_execute)  info:
>> executing - rsc:HAZELCAST_occ12 action:start call_id:394
>> Nov 15 22:45:15 occ12 pacemaker-execd     [2796]
>> (log_finished)         info: finished - rsc:HAZELCAST_occ12
>> action:start call_id:394 pid:64410 exit-code:0 exec-time:85211ms
>> queue-time:1ms
>>  
>> We have shared resource configuration setup and dummy_monitor
>> function from local node FYR.  
>>   
>> Resource setup:
>>  
>> [root at vmc0137 ~]# pcs resource show HAZELCAST_vmc0137
>> Resource: HAZELCAST_vmc0137 (class=ocf provider=provider
>> type=HazelCast_RA)
>>   Meta Attrs: failure-timeout=120s migration-threshold=5 priority=50
>>   Operations: migrate_from interval=0s timeout=20 (HAZELCAST_vmc0137-
>> migrate_from-interval-0s)
>>               migrate_to interval=0s timeout=20 (HAZELCAST_vmc0137-
>> migrate_to-interval-0s)
>>               monitor interval=10s on-fail=restart timeout=120s
>> (HAZELCAST_vmc0137-monitor-interval-10s)
>>               reload interval=0s timeout=20 (HAZELCAST_vmc0137-
>> reload-interval-0s)
>>               start interval=0s on-fail=restart timeout=120s
>> (HAZELCAST_vmc0137-start-interval-0s)
>>               stop interval=0s timeout=120s (HAZELCAST_vmc0137-stop-
>> interval-0s)
>>
>> Monitor function input:
>>
>> dummy_monitor() {
>>         # Monitor _MUST!_ differentiate correctly between running
>>         # (SUCCESS), failed (ERROR) or _cleanly_ stopped (NOT
>> RUNNING).
>>         # That is THREE states, not just yes/no
>>         #sleep ${OCF_RESKEY_op_sleep}
>>  
>> output=$(su - ogw -c "/opt/occ/$PRODUCT_NUMBER/bin/RCControl status
>> SERVER")
>> number=$(grep "Running as PID" <<< "$output" | wc -l)
>> PID=`pgrep -f "Dcmg.component.name=SERVER"`
>>  
>>      if [ $number == 1 ] || [ -n "$PID" ] ; then
>>         if [ ! -f /opt/occ/var/pid/SERVER.`hostname`.pid ]; then
>>             NOW=$(date +"%b %d %H:%M:%S")
>>             echo "$PID" > /opt/occ/var/pid/SERVER.`hostname`.pid
>>             chown ogw:med /opt/occ/var/pid/SERVER.`hostname`.pid
>>             echo "$NOW Monitor found SERVER pid file not exist and
>> going to create it" >>/var/log/cluster/corosync.log
>>         fi
>>          return $OCF_SUCCESS
>>      fi
>>      NOW=$(date +"%b %d %H:%M:%S")
>>      echo "$NOW Monitor found SERVER component is not running and
>> going for the restart" >>/var/log/cluster/corosync.log
>>      return $OCF_NOT_RUNNING
>>  
>> }
>>  
>> So, we need to support and answer to avoid above scenarios in future
>> , kindly let us know if any additional logs required. 
>>
>>                 1) Is there any options available to set fail-retry
>> conditions for resource monitor?  So, if two times resource monitor
>> fails continuously, then only it should go for recover. Other wise
>> pacemaker should initiate recover for the resource. Please confirm. 
> Not currently, but there is a proposal to support that. Basically it's
> just a matter of developer time being unavailable.
>
>>                 2)  Is there any other better option available to
>> avoid timed out issues in first occurrence itself.?
> Only increasing the timeout.
>
>>                 3)  we thought of increasing resource timeout value
>> to 300s and adding retry logic in dummy_monitor function itself on
>> the RA files with timeout command. So, In this case pgrep command
>> will be killed if couldn’t get response within 30s and retry with
>> next loop. Will this solution  help for us..?
> Yes, that should work.
iirc it was one of the reasons to think of failure-retries
triggeredby pacemaker to avoid this kind of kill & retry
logic in RAs and allthat hard to debug interference between
multiple layers of timeouts it comes with.
But of course as there is no other way ...

Klaus
>
>> Thanks & Regards,
>> S Sathish S