[ClusterLabs] query on pacemaker monitor timeout
Ken Gaillot
kgaillot at redhat.com
Fri Dec 11 13:49:18 EST 2020
On Thu, 2020-12-10 at 17:53 +0000, S Sathish S wrote:
> Hi Team,
>
> Problem Statement:
>
> pcs resource monitor got timed out after 120000ms and tried to
> recover resource(application) by stopping and starting first
> occurrence itself. Due to this restart resource which caused traffic
> impact momently in their environment. And, we suspect reason for
> timed out is at time of monitor function execution, process checks
> command got hanged & delayed due to system resource unavailability.
>
> We are not able to confirm exactly same thing is happened or not. we
> got only info like HAZELCAST_occ12_monitor_10000:59159 - terminated
> with signal 9.
>
> Error message we seen in customer node:
>
> zrangun at seliius25303[16:40][var/log/pacemaker]$ grep -ia HAZELCAST
> pacemaker.log
> Nov 15 22:42:33 occ12 pacemaker-execd [2796]
> (child_timeout_callback) warning: HAZELCAST_occ12_monitor_10000
> process (PID 57827) timed out
> Nov 15 22:42:33 occ12 pacemaker-execd [2796]
> (operation_finished) warning: HAZELCAST_occ12_monitor_10000:57827 -
> timed out after 120000ms
> Nov 15 22:42:47 occ12 pacemaker-execd [2796]
> (cancel_recurring_action) info: Cancelling ocf operation
> HAZELCAST_occ12_monitor_10000
> Nov 15 22:42:47 occ12 pacemaker-execd [2796]
> (services_action_cancel) info: Terminating in-flight op
> HAZELCAST_occ12_monitor_10000 (pid 59159) early because it was
> cancelled
> Nov 15 22:42:47 occ12 pacemaker-execd [2796]
> (operation_finished) info: HAZELCAST_occ12_monitor_10000:59159 -
> terminated with signal 9
> Nov 15 22:42:47 occ12 pacemaker-execd [2796]
> (cancel_recurring_action) info: Cancelling ocf operation
> HAZELCAST_occ12_monitor_10000
> Nov 15 22:42:47 occ12 pacemaker-execd [2796] (log_execute) info:
> executing - rsc:HAZELCAST_occ12 action:stop call_id:391
> Nov 15 22:43:41 occ12 pacemaker-execd [2796]
> (log_finished) info: finished - rsc:HAZELCAST_occ12
> action:stop call_id:391 pid:59476 exit-code:0 exec-time:53623ms
> queue-time:0ms
> Nov 15 22:43:42 occ12 pacemaker-execd [2796] (log_execute) info:
> executing - rsc:HAZELCAST_occ12 action:start call_id:392
> Nov 15 22:43:46 occ12 pacemaker-execd [2796]
> (operation_finished) notice: HAZELCAST_occ12_start_0:61681:stderr [
> touch: cannot touch '/usr/var/run/resource-agents/hazelcast-
> HAZELCAST_occ12.state': No such file or directory ]
> Nov 15 22:43:46 occ12 pacemaker-execd [2796]
> (log_finished) info: finished - rsc:HAZELCAST_occ12
> action:start call_id:392 pid:61681 exit-code:1 exec-time:3525ms
> queue-time:0ms
> Nov 15 22:43:46 occ12 pacemaker-execd [2796] (log_execute) info:
> executing - rsc:HAZELCAST_occ12 action:stop call_id:393
> Nov 15 22:43:47 occ12 pacemaker-execd [2796]
> (log_finished) info: finished - rsc:HAZELCAST_occ12
> action:stop call_id:393 pid:64134 exit-code:0 exec-time:695ms queue-
> time:0ms
> Nov 15 22:43:50 occ12 pacemaker-execd [2796] (log_execute) info:
> executing - rsc:HAZELCAST_occ12 action:start call_id:394
> Nov 15 22:45:15 occ12 pacemaker-execd [2796]
> (log_finished) info: finished - rsc:HAZELCAST_occ12
> action:start call_id:394 pid:64410 exit-code:0 exec-time:85211ms
> queue-time:1ms
>
> We have shared resource configuration setup and dummy_monitor
> function from local node FYR.
>
> Resource setup:
>
> [root at vmc0137 ~]# pcs resource show HAZELCAST_vmc0137
> Resource: HAZELCAST_vmc0137 (class=ocf provider=provider
> type=HazelCast_RA)
> Meta Attrs: failure-timeout=120s migration-threshold=5 priority=50
> Operations: migrate_from interval=0s timeout=20 (HAZELCAST_vmc0137-
> migrate_from-interval-0s)
> migrate_to interval=0s timeout=20 (HAZELCAST_vmc0137-
> migrate_to-interval-0s)
> monitor interval=10s on-fail=restart timeout=120s
> (HAZELCAST_vmc0137-monitor-interval-10s)
> reload interval=0s timeout=20 (HAZELCAST_vmc0137-
> reload-interval-0s)
> start interval=0s on-fail=restart timeout=120s
> (HAZELCAST_vmc0137-start-interval-0s)
> stop interval=0s timeout=120s (HAZELCAST_vmc0137-stop-
> interval-0s)
>
> Monitor function input:
>
> dummy_monitor() {
> # Monitor _MUST!_ differentiate correctly between running
> # (SUCCESS), failed (ERROR) or _cleanly_ stopped (NOT
> RUNNING).
> # That is THREE states, not just yes/no
> #sleep ${OCF_RESKEY_op_sleep}
>
> output=$(su - ogw -c "/opt/occ/$PRODUCT_NUMBER/bin/RCControl status
> SERVER")
> number=$(grep "Running as PID" <<< "$output" | wc -l)
> PID=`pgrep -f "Dcmg.component.name=SERVER"`
>
> if [ $number == 1 ] || [ -n "$PID" ] ; then
> if [ ! -f /opt/occ/var/pid/SERVER.`hostname`.pid ]; then
> NOW=$(date +"%b %d %H:%M:%S")
> echo "$PID" > /opt/occ/var/pid/SERVER.`hostname`.pid
> chown ogw:med /opt/occ/var/pid/SERVER.`hostname`.pid
> echo "$NOW Monitor found SERVER pid file not exist and
> going to create it" >>/var/log/cluster/corosync.log
> fi
> return $OCF_SUCCESS
> fi
> NOW=$(date +"%b %d %H:%M:%S")
> echo "$NOW Monitor found SERVER component is not running and
> going for the restart" >>/var/log/cluster/corosync.log
> return $OCF_NOT_RUNNING
>
> }
>
> So, we need to support and answer to avoid above scenarios in future
> , kindly let us know if any additional logs required.
>
> 1) Is there any options available to set fail-retry
> conditions for resource monitor? So, if two times resource monitor
> fails continuously, then only it should go for recover. Other wise
> pacemaker should initiate recover for the resource. Please confirm.
Not currently, but there is a proposal to support that. Basically it's
just a matter of developer time being unavailable.
> 2) Is there any other better option available to
> avoid timed out issues in first occurrence itself.?
Only increasing the timeout.
> 3) we thought of increasing resource timeout value
> to 300s and adding retry logic in dummy_monitor function itself on
> the RA files with timeout command. So, In this case pgrep command
> will be killed if couldn’t get response within 30s and retry with
> next loop. Will this solution help for us..?
Yes, that should work.
>
> Thanks & Regards,
> S Sathish S
--
Ken Gaillot <kgaillot at redhat.com>
More information about the Users
mailing list