<html><head></head><body><div style="color:#000; background-color:#fff; font-family:HelveticaNeue, Helvetica Neue, Helvetica, Arial, Lucida Grande, sans-serif;font-size:12px"><div id="yui_3_16_0_ym19_1_1473687074496_3082"><span id="yui_3_16_0_ym19_1_1473687074496_3081">From the error_log at the same time:</span></div><div id="yui_3_16_0_ym19_1_1473687074496_3083"><span><br></span></div><div><span></span></div><div id="yui_3_16_0_ym19_1_1473687074496_3078">[Sat Sep 10 17:35:05 2016] [notice] caught SIGWINCH, shutting down gracefully</div><div dir="ltr" id="yui_3_16_0_ym19_1_1473687074496_3079"><br id="yui_3_16_0_ym19_1_1473687074496_3080"></div><div dir="ltr" id="yui_3_16_0_ym19_1_1473687074496_3079">As far as the failure I have traced that to MaxClients limit reached, I can adjust that.</div><div dir="ltr" id="yui_3_16_0_ym19_1_1473687074496_3079"><br></div><div dir="ltr" id="yui_3_16_0_ym19_1_1473687074496_3079">However I am still concerned about the restart that it didn't work properly and the resource became un-managed.</div><div dir="ltr" id="yui_3_16_0_ym19_1_1473687074496_3079"><br></div><div dir="ltr" id="yui_3_16_0_ym19_1_1473687074496_3079">Is there anything I can do to prevent that?</div><div dir="ltr" id="yui_3_16_0_ym19_1_1473687074496_3079"><br></div><div dir="ltr" id="yui_3_16_0_ym19_1_1473687074496_3079">Thank you,</div><div dir="ltr" id="yui_3_16_0_ym19_1_1473687074496_3079">Alex</div> <div class="qtdSeparateBR"><br><br></div><div class="yahoo_quoted" style="display: block;"> <div style="font-family: HelveticaNeue, Helvetica Neue, Helvetica, Arial, Lucida Grande, sans-serif; font-size: 12px;"> <div style="font-family: HelveticaNeue, Helvetica Neue, Helvetica, Arial, Lucida Grande, sans-serif; font-size: 16px;"> <div dir="ltr"><font size="2" face="Arial"> On Monday, September 12, 2016 2:24 PM, Klaus Wenninger <kwenning@redhat.com> wrote:<br></font></div> <br><br> <div class="y_msg_container"><br clear="none">On 09/12/2016 03:00 PM, Alex wrote:<br clear="none">> Hi Klaus,<br clear="none">><br clear="none">> Thanks for the reply.<br clear="none">><br clear="none">> I dont have any logs to indicate that was indeed the PID of apache but<br clear="none">> I believe apache was killed successfully as I logged on the server<br clear="none">> apache wasn't running.<br clear="none"><br clear="none">Reason for me asking was rather if it might have been dead<br clear="none">already before and some other process had taken its' pid.<br clear="none">That would both be a reason for the monitor to fail and<br clear="none">as well for the more graceful ways of stopping to fail.<br clear="none">><br clear="none">> I am running:<br clear="none">> corosync-2.3.2-2<br clear="none">> pacemaker-1.1.10-19<br clear="none">><br clear="none">> Thanks,<br clear="none">> Alex<br clear="none">><br clear="none">><br clear="none">> On Monday, September 12, 2016 1:03 PM, Klaus Wenninger<br clear="none">> <<a shape="rect" ymailto="mailto:kwenning@redhat.com" href="mailto:kwenning@redhat.com">kwenning@redhat.com</a>> wrote:<br clear="none">><br clear="none">><br clear="none">> On 09/12/2016 12:55 PM, Alex wrote:<br clear="none">> > Hi all,<br clear="none">> ><br clear="none">> > I am having a problem with one of our pacemaker clusters that is<br clear="none">> > running in an active-active configuration.<br clear="none">> ><br clear="none">> > Sometimes the Website monitor will timeout, triggering and apache<br clear="none">> > restart that fails. That will increase the fail-count to INFINITY for<br clear="none">> > the Website resource and make in un-managed. I have tried the<br clear="none">> > following changes:<br clear="none">> ><br clear="none">> > pcs property set start-failure-is-fatal=false<br clear="none">> ><br clear="none">> > increasing the stop timeout monitor on the Website resource:<br clear="none">> > pcs resource op add Website stop interval=0s timeout=60s<br clear="none">> ><br clear="none">> > Here is the resource configuration:<br clear="none">> > Resource: Website (class=ocf provider=heartbeat type=apache)<br clear="none">> > Attributes: configfile=/etc/httpd/conf/httpd.conf<br clear="none">> > statusurl=<a shape="rect" href="http://localhost/server-status" target="_blank">http://localhost/server-status</a><br clear="none">> > Operations: start on-fail=restart interval=0s timeout=60s<br clear="none">> > (Website-name-start-interval-0s-on-fail-restart-timeout-60s)<br clear="none">> > monitor on-fail=restart interval=1min timeout=40s<br clear="none">> > (Website-name-monitor-interval-1min-on-fail-restart-timeout-40s)<br clear="none">> > stop interval=0s timeout=60s<br clear="none">> > (Website-name-stop-interval-0s-timeout-60s)<br clear="none">> ><br clear="none">> > Here is what I see in the logs when it fails:<br clear="none">> > Sep 10 17:34:25 pcs-wwwclu01-02 lrmd[2268]: warning:<br clear="none">> > child_timeout_callback: Website_monitor_60000 process (PID 10352)<br clear="none">> > timed out<br clear="none">> > Sep 10 17:34:25 pcs-wwwclu01-02 lrmd[2268]: warning:<br clear="none">> > operation_finished: Website_monitor_60000:10352 - timed out after<br clear="none">> 40000ms<br clear="none">> > Sep 10 17:34:25 pcs-wwwclu01-02 crmd[2271]: error:<br clear="none">> > process_lrm_event: LRM operation Website_monitor_60000 (32) Timed Out<br clear="none">> > (timeout=40000ms)<br clear="none">> > Sep 10 17:34:25 pcs-wwwclu01-02 crmd[2271]: warning:<br clear="none">> > update_failcount: Updating failcount for Website on pcs-wwwclu01-02<br clear="none">> > after failed monitor: rc=1 (update=value++, time=1473543265)<br clear="none">> > Sep 10 17:34:25 pcs-wwwclu01-02 attrd[2269]: notice:<br clear="none">> > attrd_trigger_update: Sending flush op to all hosts for:<br clear="none">> > fail-count-Website (1)<br clear="none">> > Sep 10 17:34:25 pcs-wwwclu01-02 attrd[2269]: notice:<br clear="none">> > attrd_perform_update: Sent update 27: fail-count-Website=1<br clear="none">> > Sep 10 17:34:25 pcs-wwwclu01-02 attrd[2269]: notice:<br clear="none">> > attrd_trigger_update: Sending flush op to all hosts for:<br clear="none">> > last-failure-Website (1473543265)<br clear="none">> > Sep 10 17:34:25 pcs-wwwclu01-02 pengine[2270]: warning:<br clear="none">> > unpack_rsc_op: Processing failed op monitor for Website:0 on<br clear="none">> > pcs-wwwclu01-02: unknown error (1)<br clear="none">> > Sep 10 17:34:25 pcs-wwwclu01-02 pengine[2270]: notice: LogActions:<br clear="none">> > Recover Website:0#011(Started pcs-wwwclu01-02)<br clear="none">> > Sep 10 17:34:25 pcs-wwwclu01-02 attrd[2269]: notice:<br clear="none">> > attrd_perform_update: Sent update 30: last-failure-Website=1473543265<br clear="none">> > Sep 10 17:34:25 pcs-wwwclu01-02 pengine[2270]: warning:<br clear="none">> > unpack_rsc_op: Processing failed op monitor for Website:0 on<br clear="none">> > pcs-wwwclu01-02: unknown error (1)<br clear="none">> > Sep 10 17:34:25 pcs-wwwclu01-02 pengine[2270]: notice: LogActions:<br clear="none">> > Recover Website:0#011(Started pcs-wwwclu01-02)<br clear="none">> > Sep 10 17:34:25 pcs-wwwclu01-02 pengine[2270]: warning:<br clear="none">> > unpack_rsc_op: Processing failed op monitor for Website:0 on<br clear="none">> > pcs-wwwclu01-02: unknown error (1)<br clear="none">> > Sep 10 17:34:25 pcs-wwwclu01-02 pengine[2270]: notice: LogActions:<br clear="none">> > Recover Website:0#011(Started pcs-wwwclu01-02)<br clear="none">> > Sep 10 17:34:25 pcs-wwwclu01-02 crmd[2271]: notice: te_rsc_command:<br clear="none">> > Initiating action 2: stop Website_stop_0 on pcs-wwwclu01-02 (local)<br clear="none">> > Sep 10 17:34:25 pcs-wwwclu01-02 apache(Website)[10443]: INFO:<br clear="none">> > Attempting graceful stop of apache PID 3561<br clear="none">> > Sep 10 17:34:55 pcs-wwwclu01-02 apache(Website)[10443]: INFO: Killing<br clear="none">> > apache PID 3561<br clear="none">> > Sep 10 17:35:04 pcs-wwwclu01-02 apache(Website)[10443]: INFO: apache<br clear="none">> > still running (3561). Killing pid failed.<br clear="none">> > Sep 10 17:35:04 pcs-wwwclu01-02 apache(Website)[10443]: INFO: apache<br clear="none">> > children were signalled (SIGTERM)<br clear="none">> > Sep 10 17:35:06 pcs-wwwclu01-02 apache(Website)[10443]: INFO: apache<br clear="none">> > children were signalled (SIGHUP)<br clear="none">> > Sep 10 17:35:07 pcs-wwwclu01-02 crmd[2271]: notice:<br clear="none">> > process_lrm_event: LRM operation Website_stop_0 (call=34, rc=1,<br clear="none">> > cib-update=3097, confirmed=true) unknown error<br clear="none">> > Sep 10 17:35:07 pcs-wwwclu01-02 crmd[2271]: warning: status_from_rc:<br clear="none">> > Action 2 (Website_stop_0) on pcs-wwwclu01-02 failed (target: 0 vs. rc:<br clear="none">> > 1): Error<br clear="none">> > Sep 10 17:35:07 pcs-wwwclu01-02 crmd[2271]: warning:<br clear="none">> > update_failcount: Updating failcount for Website on pcs-wwwclu01-02<br clear="none">> > after failed stop: rc=1 (update=INFINITY, time=1473543307)<br clear="none">> > Sep 10 17:35:07 pcs-wwwclu01-02 attrd[2269]: notice:<br clear="none">> > attrd_trigger_update: Sending flush op to all hosts for:<br clear="none">> > fail-count-Website (INFINITY)<br clear="none">> > Sep 10 17:35:07 pcs-wwwclu01-02 attrd[2269]: notice:<br clear="none">> > attrd_perform_update: Sent update 32: fail-count-Website=INFINITY<br clear="none">> > Sep 10 17:35:07 pcs-wwwclu01-02 attrd[2269]: notice:<br clear="none">> > attrd_trigger_update: Sending flush op to all hosts for:<br clear="none">> > last-failure-Website (1473543307)<br clear="none">> > Sep 10 17:35:07 pcs-wwwclu01-02 attrd[2269]: notice:<br clear="none">> > attrd_perform_update: Sent update 34: last-failure-Website=1473543307<br clear="none">> > Sep 10 17:35:07 pcs-wwwclu01-02 attrd[2269]: notice:<br clear="none">> > attrd_trigger_update: Sending flush op to all hosts for:<br clear="none">> > fail-count-Website (INFINITY)<br clear="none">> > Sep 10 17:35:07 pcs-wwwclu01-02 pengine[2270]: warning:<br clear="none">> > unpack_rsc_op: Processing failed op stop for Website:0 on<br clear="none">> > pcs-wwwclu01-02: unknown error (1)<br clear="none">> > Sep 10 17:35:07 pcs-wwwclu01-02 attrd[2269]: notice:<br clear="none">> > attrd_perform_update: Sent update 36: fail-count-Website=INFINITY<br clear="none">> > Sep 10 17:35:07 pcs-wwwclu01-02 attrd[2269]: notice:<br clear="none">> > attrd_trigger_update: Sending flush op to all hosts for:<br clear="none">> > last-failure-Website (1473543307)<br clear="none">> > Sep 10 17:35:07 pcs-wwwclu01-02 attrd[2269]: notice:<br clear="none">> > attrd_perform_update: Sent update 38: last-failure-Website=1473543307<br clear="none">> > Sep 10 17:35:07 pcs-wwwclu01-02 pengine[2270]: warning:<br clear="none">> > unpack_rsc_op: Processing failed op stop for Website:0 on<br clear="none">> > pcs-wwwclu01-02: unknown error (1)<br clear="none">> > Sep 10 17:35:07 pcs-wwwclu01-02 pengine[2270]: warning:<br clear="none">> > common_apply_stickiness: Forcing Website-clone away from<br clear="none">> > pcs-wwwclu01-02 after 1000000 failures (max=1000000)<br clear="none">> > Sep 10 17:35:07 pcs-wwwclu01-02 pengine[2270]: warning:<br clear="none">> > unpack_rsc_op: Processing failed op stop for Website:0 on<br clear="none">> > pcs-wwwclu01-02: unknown error (1)<br clear="none">> > Sep 10 17:35:07 pcs-wwwclu01-02 pengine[2270]: warning:<br clear="none">> > common_apply_stickiness: Forcing Website-clone away from<br clear="none">> > pcs-wwwclu01-02 after 1000000 failures (max=1000000)<br clear="none">> ><br clear="none">> > I dont see that pacemaker is waiting for 60 seconds for the apache to<br clear="none">> > stop.<br clear="none">><br clear="none">> .../heartbeat/apache:<br clear="none">><br clear="none">> graceful_stop()<br clear="none">><br clear="none">> {<br clear="none">><br clear="none">> ...<br clear="none">><br clear="none">> # Try graceful stop for half timeout period if timeout period<br clear="none">> is present<br clear="none">><br clear="none">> if [ -n "$OCF_RESKEY_CRM_meta_timeout" ]; then<br clear="none">><br clear="none">> tries=$((($OCF_RESKEY_CRM_meta_timeout/1000) / 2))<br clear="none">><br clear="none">> fi<br clear="none">><br clear="none">> so the 30 seconds from the log are to be expected.<br clear="none">> Why it doesn't terminate within this 30 seconds and<br clear="none">> why escalation to SIGTERM doesn't help either is<br clear="none">> written on another page ...<br clear="none">><br clear="none">> Do you have logs showing if at the time when stopping<br clear="none">> was tried 3561 was really the pid of a running apache?<br clear="none">> Don't see the RA (at least the version I have on my<br clear="none">> test-cluster) anywhere checking for the running binary<br clear="none">> or alike.<br clear="none">><br clear="none">><br clear="none">> > Has anyone encountered something like this before? Or am I missing<br clear="none">> > something in the configuration?<br clear="none">> ><br clear="none">> > Thank you,<br clear="none">> > Alex<br clear="none">><br clear="none">> ><br clear="none">> ><br clear="none">> ><br clear="none">> ><br clear="none">> > _______________________________________________<br clear="none">> > Users mailing list: <a shape="rect" ymailto="mailto:Users@clusterlabs.org" href="mailto:Users@clusterlabs.org">Users@clusterlabs.org</a> <mailto:<a shape="rect" ymailto="mailto:Users@clusterlabs.org" href="mailto:Users@clusterlabs.org">Users@clusterlabs.org</a>><br clear="none">> > <a shape="rect" href="http://clusterlabs.org/mailman/listinfo/users" target="_blank">http://clusterlabs.org/mailman/listinfo/users</a><br clear="none">> ><br clear="none">> > Project Home: <a shape="rect" href="http://www.clusterlabs.org/" target="_blank">http://www.clusterlabs.org </a><<a shape="rect" href="http://www.clusterlabs.org/" target="_blank">http://www.clusterlabs.org/</a>><br clear="none">> > Getting started: <a shape="rect" href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf" target="_blank">http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf</a><br clear="none">> > Bugs: <a shape="rect" href="http://bugs.clusterlabs.org/" target="_blank">http://bugs.clusterlabs.org </a><<a shape="rect" href="http://bugs.clusterlabs.org/" target="_blank">http://bugs.clusterlabs.org/</a>><br clear="none">><br clear="none">><br clear="none">><br clear="none">> _______________________________________________<br clear="none">> Users mailing list: <a shape="rect" ymailto="mailto:Users@clusterlabs.org" href="mailto:Users@clusterlabs.org">Users@clusterlabs.org</a> <mailto:<a shape="rect" ymailto="mailto:Users@clusterlabs.org" href="mailto:Users@clusterlabs.org">Users@clusterlabs.org</a>><div class="yqt5397986584" id="yqtfd39721"><br clear="none">> <a shape="rect" href="http://clusterlabs.org/mailman/listinfo/users" target="_blank">http://clusterlabs.org/mailman/listinfo/users</a><br clear="none">><br clear="none">> Project Home: <a shape="rect" href="http://www.clusterlabs.org/" target="_blank">http://www.clusterlabs.org </a><<a shape="rect" href="http://www.clusterlabs.org/" target="_blank">http://www.clusterlabs.org/</a>><br clear="none">> Getting started: <a shape="rect" href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf" target="_blank">http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf</a><br clear="none">> Bugs: <a shape="rect" href="http://bugs.clusterlabs.org/" target="_blank">http://bugs.clusterlabs.org </a><<a shape="rect" href="http://bugs.clusterlabs.org/" target="_blank">http://bugs.clusterlabs.org/</a>><br clear="none">><br clear="none">><br clear="none">><br clear="none"><br clear="none"><br clear="none">_______________________________________________<br clear="none">Users mailing list: <a shape="rect" ymailto="mailto:Users@clusterlabs.org" href="mailto:Users@clusterlabs.org">Users@clusterlabs.org</a><br clear="none"><a shape="rect" href="http://clusterlabs.org/mailman/listinfo/users" target="_blank">http://clusterlabs.org/mailman/listinfo/users</a><br clear="none"><br clear="none">Project Home: <a shape="rect" href="http://www.clusterlabs.org/" target="_blank">http://www.clusterlabs.org</a><br clear="none">Getting started: <a shape="rect" href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf" target="_blank">http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf</a><br clear="none">Bugs: <a shape="rect" href="http://bugs.clusterlabs.org/" target="_blank">http://bugs.clusterlabs.org</a><br clear="none"></div><br><br></div> </div> </div> </div></div></body></html>