[ClusterLabs] Pacemaker Active-Active setup monitor problem

Mon Sep 12 06:55:14 EDT 2016

Hi all,
I am having a problem with one of our pacemaker clusters that is running in an active-active configuration.
Sometimes the Website monitor will timeout, triggering and apache restart that fails. That will increase the fail-count to INFINITY for the Website resource and make in un-managed. I have tried the following changes:
pcs property set start-failure-is-fatal=false

increasing the stop timeout monitor on the Website resource:pcs resource op add Website stop interval=0s timeout=60s

Here is the resource configuration: Resource: Website (class=ocf provider=heartbeat type=apache)  Attributes: configfile=/etc/httpd/conf/httpd.conf statusurl=http://localhost/server-status   Operations: start on-fail=restart interval=0s timeout=60s (Website-name-start-interval-0s-on-fail-restart-timeout-60s)              monitor on-fail=restart interval=1min timeout=40s (Website-name-monitor-interval-1min-on-fail-restart-timeout-40s)              stop interval=0s timeout=60s (Website-name-stop-interval-0s-timeout-60s)
Here is what I see in the logs when it fails:
Sep 10 17:34:25 pcs-wwwclu01-02 lrmd[2268]:  warning: child_timeout_callback: Website_monitor_60000 process (PID 10352) timed outSep 10 17:34:25 pcs-wwwclu01-02 lrmd[2268]:  warning: operation_finished: Website_monitor_60000:10352 - timed out after 40000msSep 10 17:34:25 pcs-wwwclu01-02 crmd[2271]:    error: process_lrm_event: LRM operation Website_monitor_60000 (32) Timed Out (timeout=40000ms)Sep 10 17:34:25 pcs-wwwclu01-02 crmd[2271]:  warning: update_failcount: Updating failcount for Website on pcs-wwwclu01-02 after failed monitor: rc=1 (update=value++, time=1473543265)Sep 10 17:34:25 pcs-wwwclu01-02 attrd[2269]:   notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-Website (1)Sep 10 17:34:25 pcs-wwwclu01-02 attrd[2269]:   notice: attrd_perform_update: Sent update 27: fail-count-Website=1Sep 10 17:34:25 pcs-wwwclu01-02 attrd[2269]:   notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-Website (1473543265)Sep 10 17:34:25 pcs-wwwclu01-02 pengine[2270]:  warning: unpack_rsc_op: Processing failed op monitor for Website:0 on pcs-wwwclu01-02: unknown error (1)Sep 10 17:34:25 pcs-wwwclu01-02 pengine[2270]:   notice: LogActions: Recover Website:0#011(Started pcs-wwwclu01-02)Sep 10 17:34:25 pcs-wwwclu01-02 attrd[2269]:   notice: attrd_perform_update: Sent update 30: last-failure-Website=1473543265Sep 10 17:34:25 pcs-wwwclu01-02 pengine[2270]:  warning: unpack_rsc_op: Processing failed op monitor for Website:0 on pcs-wwwclu01-02: unknown error (1)Sep 10 17:34:25 pcs-wwwclu01-02 pengine[2270]:   notice: LogActions: Recover Website:0#011(Started pcs-wwwclu01-02)Sep 10 17:34:25 pcs-wwwclu01-02 pengine[2270]:  warning: unpack_rsc_op: Processing failed op monitor for Website:0 on pcs-wwwclu01-02: unknown error (1)Sep 10 17:34:25 pcs-wwwclu01-02 pengine[2270]:   notice: LogActions: Recover Website:0#011(Started pcs-wwwclu01-02)Sep 10 17:34:25 pcs-wwwclu01-02 crmd[2271]:   notice: te_rsc_command: Initiating action 2: stop Website_stop_0 on pcs-wwwclu01-02 (local)Sep 10 17:34:25 pcs-wwwclu01-02 apache(Website)[10443]: INFO: Attempting graceful stop of apache PID 3561Sep 10 17:34:55 pcs-wwwclu01-02 apache(Website)[10443]: INFO: Killing apache PID 3561Sep 10 17:35:04 pcs-wwwclu01-02 apache(Website)[10443]: INFO: apache still running (3561). Killing pid failed.Sep 10 17:35:04 pcs-wwwclu01-02 apache(Website)[10443]: INFO: apache children were signalled (SIGTERM)Sep 10 17:35:06 pcs-wwwclu01-02 apache(Website)[10443]: INFO: apache children were signalled (SIGHUP)Sep 10 17:35:07 pcs-wwwclu01-02 crmd[2271]:   notice: process_lrm_event: LRM operation Website_stop_0 (call=34, rc=1, cib-update=3097, confirmed=true) unknown errorSep 10 17:35:07 pcs-wwwclu01-02 crmd[2271]:  warning: status_from_rc: Action 2 (Website_stop_0) on pcs-wwwclu01-02 failed (target: 0 vs. rc: 1): ErrorSep 10 17:35:07 pcs-wwwclu01-02 crmd[2271]:  warning: update_failcount: Updating failcount for Website on pcs-wwwclu01-02 after failed stop: rc=1 (update=INFINITY, time=1473543307)Sep 10 17:35:07 pcs-wwwclu01-02 attrd[2269]:   notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-Website (INFINITY)Sep 10 17:35:07 pcs-wwwclu01-02 attrd[2269]:   notice: attrd_perform_update: Sent update 32: fail-count-Website=INFINITYSep 10 17:35:07 pcs-wwwclu01-02 attrd[2269]:   notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-Website (1473543307)Sep 10 17:35:07 pcs-wwwclu01-02 attrd[2269]:   notice: attrd_perform_update: Sent update 34: last-failure-Website=1473543307Sep 10 17:35:07 pcs-wwwclu01-02 attrd[2269]:   notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-Website (INFINITY)Sep 10 17:35:07 pcs-wwwclu01-02 pengine[2270]:  warning: unpack_rsc_op: Processing failed op stop for Website:0 on pcs-wwwclu01-02: unknown error (1)Sep 10 17:35:07 pcs-wwwclu01-02 attrd[2269]:   notice: attrd_perform_update: Sent update 36: fail-count-Website=INFINITYSep 10 17:35:07 pcs-wwwclu01-02 attrd[2269]:   notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-Website (1473543307)Sep 10 17:35:07 pcs-wwwclu01-02 attrd[2269]:   notice: attrd_perform_update: Sent update 38: last-failure-Website=1473543307Sep 10 17:35:07 pcs-wwwclu01-02 pengine[2270]:  warning: unpack_rsc_op: Processing failed op stop for Website:0 on pcs-wwwclu01-02: unknown error (1)Sep 10 17:35:07 pcs-wwwclu01-02 pengine[2270]:  warning: common_apply_stickiness: Forcing Website-clone away from pcs-wwwclu01-02 after 1000000 failures (max=1000000)Sep 10 17:35:07 pcs-wwwclu01-02 pengine[2270]:  warning: unpack_rsc_op: Processing failed op stop for Website:0 on pcs-wwwclu01-02: unknown error (1)Sep 10 17:35:07 pcs-wwwclu01-02 pengine[2270]:  warning: common_apply_stickiness: Forcing Website-clone away from pcs-wwwclu01-02 after 1000000 failures (max=1000000)
I dont see that pacemaker is waiting for 60 seconds for the apache to stop.
Has anyone encountered something like this before? Or am I missing something in the configuration?
Thank you,Alex

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20160912/515250c7/attachment-0002.html>