<html><head></head><body><div style="color:#000; background-color:#fff; font-family:HelveticaNeue, Helvetica Neue, Helvetica, Arial, Lucida Grande, sans-serif;font-size:12px"><div id="yui_3_16_0_ym19_1_1473683432928_4051"><span>Hi Klaus,</span></div><div id="yui_3_16_0_ym19_1_1473683432928_4050"><span><br>Thanks for the reply.</span></div><div id="yui_3_16_0_ym19_1_1473683432928_4049"><span><br></span></div><div dir="ltr" id="yui_3_16_0_ym19_1_1473683432928_4074"><span>I dont have any logs to indicate that was indeed the PID of apache but I believe apache was killed successfully as I logged on the server apache wasn't running.</span></div><div dir="ltr" id="yui_3_16_0_ym19_1_1473683432928_4074"><span><br></span></div><div dir="ltr" id="yui_3_16_0_ym19_1_1473683432928_4074"><span>I am running:</span></div><div dir="ltr" id="yui_3_16_0_ym19_1_1473683432928_4074"><span>corosync-2.3.2-2<br></span></div><div dir="ltr" id="yui_3_16_0_ym19_1_1473683432928_4074"><span>pacemaker-1.1.10-19<br></span></div><div dir="ltr" id="yui_3_16_0_ym19_1_1473683432928_4074"><span><br></span></div><div dir="ltr" id="yui_3_16_0_ym19_1_1473683432928_4074"><span>Thanks,</span></div><div dir="ltr" id="yui_3_16_0_ym19_1_1473683432928_4074"><span>Alex</span></div> <div class="qtdSeparateBR"><br><br></div><div class="yahoo_quoted" style="display: block;"> <div style="font-family: HelveticaNeue, Helvetica Neue, Helvetica, Arial, Lucida Grande, sans-serif; font-size: 12px;"> <div style="font-family: HelveticaNeue, Helvetica Neue, Helvetica, Arial, Lucida Grande, sans-serif; font-size: 16px;"> <div dir="ltr"><font size="2" face="Arial"> On Monday, September 12, 2016 1:03 PM, Klaus Wenninger <kwenning@redhat.com> wrote:<br></font></div>  <br><br> <div class="y_msg_container">On 09/12/2016 12:55 PM, Alex wrote:<br clear="none">> Hi all,<br clear="none">><br clear="none">> I am having a problem with one of our pacemaker clusters that is<br clear="none">> running in an active-active configuration.<br clear="none">><br clear="none">> Sometimes the Website monitor will timeout, triggering and apache<br clear="none">> restart that fails. That will increase the fail-count to INFINITY for<br clear="none">> the Website resource and make in un-managed. I have tried the<br clear="none">> following changes:<br clear="none">><br clear="none">> pcs property set start-failure-is-fatal=false<br clear="none">><br clear="none">> increasing the stop timeout monitor on the Website resource:<br clear="none">> pcs resource op add Website stop interval=0s timeout=60s<br clear="none">><br clear="none">> Here is the resource configuration:<br clear="none">>  Resource: Website (class=ocf provider=heartbeat type=apache)<br clear="none">>   Attributes: configfile=/etc/httpd/conf/httpd.conf<br clear="none">> statusurl=<a shape="rect" href="http://localhost/server-status" target="_blank">http://localhost/server-status </a><br clear="none">>   Operations: start on-fail=restart interval=0s timeout=60s<br clear="none">> (Website-name-start-interval-0s-on-fail-restart-timeout-60s)<br clear="none">>               monitor on-fail=restart interval=1min timeout=40s<br clear="none">> (Website-name-monitor-interval-1min-on-fail-restart-timeout-40s)<br clear="none">>               stop interval=0s timeout=60s<br clear="none">> (Website-name-stop-interval-0s-timeout-60s)<br clear="none">><br clear="none">> Here is what I see in the logs when it fails:<br clear="none">> Sep 10 17:34:25 pcs-wwwclu01-02 lrmd[2268]:  warning:<br clear="none">> child_timeout_callback: Website_monitor_60000 process (PID 10352)<br clear="none">> timed out<br clear="none">> Sep 10 17:34:25 pcs-wwwclu01-02 lrmd[2268]:  warning:<br clear="none">> operation_finished: Website_monitor_60000:10352 - timed out after 40000ms<br clear="none">> Sep 10 17:34:25 pcs-wwwclu01-02 crmd[2271]:    error:<br clear="none">> process_lrm_event: LRM operation Website_monitor_60000 (32) Timed Out<br clear="none">> (timeout=40000ms)<br clear="none">> Sep 10 17:34:25 pcs-wwwclu01-02 crmd[2271]:  warning:<br clear="none">> update_failcount: Updating failcount for Website on pcs-wwwclu01-02<br clear="none">> after failed monitor: rc=1 (update=value++, time=1473543265)<br clear="none">> Sep 10 17:34:25 pcs-wwwclu01-02 attrd[2269]:   notice:<br clear="none">> attrd_trigger_update: Sending flush op to all hosts for:<br clear="none">> fail-count-Website (1)<br clear="none">> Sep 10 17:34:25 pcs-wwwclu01-02 attrd[2269]:   notice:<br clear="none">> attrd_perform_update: Sent update 27: fail-count-Website=1<br clear="none">> Sep 10 17:34:25 pcs-wwwclu01-02 attrd[2269]:   notice:<br clear="none">> attrd_trigger_update: Sending flush op to all hosts for:<br clear="none">> last-failure-Website (1473543265)<br clear="none">> Sep 10 17:34:25 pcs-wwwclu01-02 pengine[2270]:  warning:<br clear="none">> unpack_rsc_op: Processing failed op monitor for Website:0 on<br clear="none">> pcs-wwwclu01-02: unknown error (1)<br clear="none">> Sep 10 17:34:25 pcs-wwwclu01-02 pengine[2270]:   notice: LogActions:<br clear="none">> Recover Website:0#011(Started pcs-wwwclu01-02)<br clear="none">> Sep 10 17:34:25 pcs-wwwclu01-02 attrd[2269]:   notice:<br clear="none">> attrd_perform_update: Sent update 30: last-failure-Website=1473543265<br clear="none">> Sep 10 17:34:25 pcs-wwwclu01-02 pengine[2270]:  warning:<br clear="none">> unpack_rsc_op: Processing failed op monitor for Website:0 on<br clear="none">> pcs-wwwclu01-02: unknown error (1)<br clear="none">> Sep 10 17:34:25 pcs-wwwclu01-02 pengine[2270]:   notice: LogActions:<br clear="none">> Recover Website:0#011(Started pcs-wwwclu01-02)<br clear="none">> Sep 10 17:34:25 pcs-wwwclu01-02 pengine[2270]:  warning:<br clear="none">> unpack_rsc_op: Processing failed op monitor for Website:0 on<br clear="none">> pcs-wwwclu01-02: unknown error (1)<br clear="none">> Sep 10 17:34:25 pcs-wwwclu01-02 pengine[2270]:   notice: LogActions:<br clear="none">> Recover Website:0#011(Started pcs-wwwclu01-02)<br clear="none">> Sep 10 17:34:25 pcs-wwwclu01-02 crmd[2271]:   notice: te_rsc_command:<br clear="none">> Initiating action 2: stop Website_stop_0 on pcs-wwwclu01-02 (local)<br clear="none">> Sep 10 17:34:25 pcs-wwwclu01-02 apache(Website)[10443]: INFO:<br clear="none">> Attempting graceful stop of apache PID 3561<br clear="none">> Sep 10 17:34:55 pcs-wwwclu01-02 apache(Website)[10443]: INFO: Killing<br clear="none">> apache PID 3561<br clear="none">> Sep 10 17:35:04 pcs-wwwclu01-02 apache(Website)[10443]: INFO: apache<br clear="none">> still running (3561). Killing pid failed.<br clear="none">> Sep 10 17:35:04 pcs-wwwclu01-02 apache(Website)[10443]: INFO: apache<br clear="none">> children were signalled (SIGTERM)<br clear="none">> Sep 10 17:35:06 pcs-wwwclu01-02 apache(Website)[10443]: INFO: apache<br clear="none">> children were signalled (SIGHUP)<br clear="none">> Sep 10 17:35:07 pcs-wwwclu01-02 crmd[2271]:   notice:<br clear="none">> process_lrm_event: LRM operation Website_stop_0 (call=34, rc=1,<br clear="none">> cib-update=3097, confirmed=true) unknown error<br clear="none">> Sep 10 17:35:07 pcs-wwwclu01-02 crmd[2271]:  warning: status_from_rc:<br clear="none">> Action 2 (Website_stop_0) on pcs-wwwclu01-02 failed (target: 0 vs. rc:<br clear="none">> 1): Error<br clear="none">> Sep 10 17:35:07 pcs-wwwclu01-02 crmd[2271]:  warning:<br clear="none">> update_failcount: Updating failcount for Website on pcs-wwwclu01-02<br clear="none">> after failed stop: rc=1 (update=INFINITY, time=1473543307)<br clear="none">> Sep 10 17:35:07 pcs-wwwclu01-02 attrd[2269]:   notice:<br clear="none">> attrd_trigger_update: Sending flush op to all hosts for:<br clear="none">> fail-count-Website (INFINITY)<br clear="none">> Sep 10 17:35:07 pcs-wwwclu01-02 attrd[2269]:   notice:<br clear="none">> attrd_perform_update: Sent update 32: fail-count-Website=INFINITY<br clear="none">> Sep 10 17:35:07 pcs-wwwclu01-02 attrd[2269]:   notice:<br clear="none">> attrd_trigger_update: Sending flush op to all hosts for:<br clear="none">> last-failure-Website (1473543307)<br clear="none">> Sep 10 17:35:07 pcs-wwwclu01-02 attrd[2269]:   notice:<br clear="none">> attrd_perform_update: Sent update 34: last-failure-Website=1473543307<br clear="none">> Sep 10 17:35:07 pcs-wwwclu01-02 attrd[2269]:   notice:<br clear="none">> attrd_trigger_update: Sending flush op to all hosts for:<br clear="none">> fail-count-Website (INFINITY)<br clear="none">> Sep 10 17:35:07 pcs-wwwclu01-02 pengine[2270]:  warning:<br clear="none">> unpack_rsc_op: Processing failed op stop for Website:0 on<br clear="none">> pcs-wwwclu01-02: unknown error (1)<br clear="none">> Sep 10 17:35:07 pcs-wwwclu01-02 attrd[2269]:   notice:<br clear="none">> attrd_perform_update: Sent update 36: fail-count-Website=INFINITY<br clear="none">> Sep 10 17:35:07 pcs-wwwclu01-02 attrd[2269]:   notice:<br clear="none">> attrd_trigger_update: Sending flush op to all hosts for:<br clear="none">> last-failure-Website (1473543307)<br clear="none">> Sep 10 17:35:07 pcs-wwwclu01-02 attrd[2269]:   notice:<br clear="none">> attrd_perform_update: Sent update 38: last-failure-Website=1473543307<br clear="none">> Sep 10 17:35:07 pcs-wwwclu01-02 pengine[2270]:  warning:<br clear="none">> unpack_rsc_op: Processing failed op stop for Website:0 on<br clear="none">> pcs-wwwclu01-02: unknown error (1)<br clear="none">> Sep 10 17:35:07 pcs-wwwclu01-02 pengine[2270]:  warning:<br clear="none">> common_apply_stickiness: Forcing Website-clone away from<br clear="none">> pcs-wwwclu01-02 after 1000000 failures (max=1000000)<br clear="none">> Sep 10 17:35:07 pcs-wwwclu01-02 pengine[2270]:  warning:<br clear="none">> unpack_rsc_op: Processing failed op stop for Website:0 on<br clear="none">> pcs-wwwclu01-02: unknown error (1)<br clear="none">> Sep 10 17:35:07 pcs-wwwclu01-02 pengine[2270]:  warning:<br clear="none">> common_apply_stickiness: Forcing Website-clone away from<br clear="none">> pcs-wwwclu01-02 after 1000000 failures (max=1000000)<br clear="none">><br clear="none">> I dont see that pacemaker is waiting for 60 seconds for the apache to<br clear="none">> stop.<br clear="none"><br clear="none">.../heartbeat/apache:<br clear="none"><br clear="none">graceful_stop()<br clear="none"><br clear="none">{<br clear="none"><br clear="none">...<br clear="none"><br clear="none">        # Try graceful stop for half timeout period if timeout period is present<br clear="none"><br clear="none">        if [ -n "$OCF_RESKEY_CRM_meta_timeout" ]; then<br clear="none"><br clear="none">                tries=$((($OCF_RESKEY_CRM_meta_timeout/1000) / 2))<br clear="none"><br clear="none">        fi<br clear="none"><br clear="none">so the 30 seconds from the log are to be expected.<br clear="none">Why it doesn't terminate within this 30 seconds and<br clear="none">why escalation to SIGTERM doesn't help either is<br clear="none">written on another page ...<br clear="none"><br clear="none">Do you have logs showing if at the time when stopping<br clear="none">was tried 3561 was really the pid of a running apache?<br clear="none">Don't see the RA (at least the version I have on my<br clear="none">test-cluster) anywhere checking for the running binary<br clear="none">or alike.<div class="yqt3506557955" id="yqtfd94208"><br clear="none"> <br clear="none">> Has anyone encountered something like this before? Or am I missing<br clear="none">> something in the configuration?<br clear="none">><br clear="none">> Thank you,<br clear="none">> Alex</div><br clear="none">><br clear="none">><br clear="none">><br clear="none">><br clear="none">> _______________________________________________<br clear="none">> Users mailing list: <a shape="rect" ymailto="mailto:Users@clusterlabs.org" href="mailto:Users@clusterlabs.org">Users@clusterlabs.org</a><br clear="none">> <a shape="rect" href="http://clusterlabs.org/mailman/listinfo/users" target="_blank">http://clusterlabs.org/mailman/listinfo/users</a><br clear="none">><br clear="none">> Project Home: <a shape="rect" href="http://www.clusterlabs.org/" target="_blank">http://www.clusterlabs.org</a><br clear="none">> Getting started: <a shape="rect" href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf" target="_blank">http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf</a><br clear="none">> Bugs: <a shape="rect" href="http://bugs.clusterlabs.org/" target="_blank">http://bugs.clusterlabs.org</a><br clear="none"><br clear="none"><br clear="none"><br clear="none">_______________________________________________<br clear="none">Users mailing list: <a shape="rect" ymailto="mailto:Users@clusterlabs.org" href="mailto:Users@clusterlabs.org">Users@clusterlabs.org</a><br clear="none"><a shape="rect" href="http://clusterlabs.org/mailman/listinfo/users" target="_blank">http://clusterlabs.org/mailman/listinfo/users</a><br clear="none"><br clear="none">Project Home: <a shape="rect" href="http://www.clusterlabs.org/" target="_blank">http://www.clusterlabs.org</a><br clear="none">Getting started: <a shape="rect" href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf" target="_blank">http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf</a><br clear="none">Bugs: <a shape="rect" href="http://bugs.clusterlabs.org/" target="_blank">http://bugs.clusterlabs.org</a><div class="yqt3506557955" id="yqtfd43948"><br clear="none"></div><br><br></div>  </div> </div>  </div></div></body></html>