[ClusterLabs] Pacemaker Active-Active setup monitor problem

Mon Sep 12 11:43:32 UTC 2016

On 09/12/2016 12:55 PM, Alex wrote:
> Hi all,
>
> I am having a problem with one of our pacemaker clusters that is
> running in an active-active configuration.
>
> Sometimes the Website monitor will timeout, triggering and apache
> restart that fails. That will increase the fail-count to INFINITY for
> the Website resource and make in un-managed. I have tried the
> following changes:
>
> pcs property set start-failure-is-fatal=false
>
> increasing the stop timeout monitor on the Website resource:
> pcs resource op add Website stop interval=0s timeout=60s
>
> Here is the resource configuration:
>  Resource: Website (class=ocf provider=heartbeat type=apache)
>   Attributes: configfile=/etc/httpd/conf/httpd.conf
> statusurl=http://localhost/server-status 
>   Operations: start on-fail=restart interval=0s timeout=60s
> (Website-name-start-interval-0s-on-fail-restart-timeout-60s)
>               monitor on-fail=restart interval=1min timeout=40s
> (Website-name-monitor-interval-1min-on-fail-restart-timeout-40s)
>               stop interval=0s timeout=60s
> (Website-name-stop-interval-0s-timeout-60s)
>
> Here is what I see in the logs when it fails:
> Sep 10 17:34:25 pcs-wwwclu01-02 lrmd[2268]:  warning:
> child_timeout_callback: Website_monitor_60000 process (PID 10352)
> timed out
> Sep 10 17:34:25 pcs-wwwclu01-02 lrmd[2268]:  warning:
> operation_finished: Website_monitor_60000:10352 - timed out after 40000ms
> Sep 10 17:34:25 pcs-wwwclu01-02 crmd[2271]:    error:
> process_lrm_event: LRM operation Website_monitor_60000 (32) Timed Out
> (timeout=40000ms)
> Sep 10 17:34:25 pcs-wwwclu01-02 crmd[2271]:  warning:
> update_failcount: Updating failcount for Website on pcs-wwwclu01-02
> after failed monitor: rc=1 (update=value++, time=1473543265)
> Sep 10 17:34:25 pcs-wwwclu01-02 attrd[2269]:   notice:
> attrd_trigger_update: Sending flush op to all hosts for:
> fail-count-Website (1)
> Sep 10 17:34:25 pcs-wwwclu01-02 attrd[2269]:   notice:
> attrd_perform_update: Sent update 27: fail-count-Website=1
> Sep 10 17:34:25 pcs-wwwclu01-02 attrd[2269]:   notice:
> attrd_trigger_update: Sending flush op to all hosts for:
> last-failure-Website (1473543265)
> Sep 10 17:34:25 pcs-wwwclu01-02 pengine[2270]:  warning:
> unpack_rsc_op: Processing failed op monitor for Website:0 on
> pcs-wwwclu01-02: unknown error (1)
> Sep 10 17:34:25 pcs-wwwclu01-02 pengine[2270]:   notice: LogActions:
> Recover Website:0#011(Started pcs-wwwclu01-02)
> Sep 10 17:34:25 pcs-wwwclu01-02 attrd[2269]:   notice:
> attrd_perform_update: Sent update 30: last-failure-Website=1473543265
> Sep 10 17:34:25 pcs-wwwclu01-02 pengine[2270]:  warning:
> unpack_rsc_op: Processing failed op monitor for Website:0 on
> pcs-wwwclu01-02: unknown error (1)
> Sep 10 17:34:25 pcs-wwwclu01-02 pengine[2270]:   notice: LogActions:
> Recover Website:0#011(Started pcs-wwwclu01-02)
> Sep 10 17:34:25 pcs-wwwclu01-02 pengine[2270]:  warning:
> unpack_rsc_op: Processing failed op monitor for Website:0 on
> pcs-wwwclu01-02: unknown error (1)
> Sep 10 17:34:25 pcs-wwwclu01-02 pengine[2270]:   notice: LogActions:
> Recover Website:0#011(Started pcs-wwwclu01-02)
> Sep 10 17:34:25 pcs-wwwclu01-02 crmd[2271]:   notice: te_rsc_command:
> Initiating action 2: stop Website_stop_0 on pcs-wwwclu01-02 (local)
> Sep 10 17:34:25 pcs-wwwclu01-02 apache(Website)[10443]: INFO:
> Attempting graceful stop of apache PID 3561
> Sep 10 17:34:55 pcs-wwwclu01-02 apache(Website)[10443]: INFO: Killing
> apache PID 3561
> Sep 10 17:35:04 pcs-wwwclu01-02 apache(Website)[10443]: INFO: apache
> still running (3561). Killing pid failed.
> Sep 10 17:35:04 pcs-wwwclu01-02 apache(Website)[10443]: INFO: apache
> children were signalled (SIGTERM)
> Sep 10 17:35:06 pcs-wwwclu01-02 apache(Website)[10443]: INFO: apache
> children were signalled (SIGHUP)
> Sep 10 17:35:07 pcs-wwwclu01-02 crmd[2271]:   notice:
> process_lrm_event: LRM operation Website_stop_0 (call=34, rc=1,
> cib-update=3097, confirmed=true) unknown error
> Sep 10 17:35:07 pcs-wwwclu01-02 crmd[2271]:  warning: status_from_rc:
> Action 2 (Website_stop_0) on pcs-wwwclu01-02 failed (target: 0 vs. rc:
> 1): Error
> Sep 10 17:35:07 pcs-wwwclu01-02 crmd[2271]:  warning:
> update_failcount: Updating failcount for Website on pcs-wwwclu01-02
> after failed stop: rc=1 (update=INFINITY, time=1473543307)
> Sep 10 17:35:07 pcs-wwwclu01-02 attrd[2269]:   notice:
> attrd_trigger_update: Sending flush op to all hosts for:
> fail-count-Website (INFINITY)
> Sep 10 17:35:07 pcs-wwwclu01-02 attrd[2269]:   notice:
> attrd_perform_update: Sent update 32: fail-count-Website=INFINITY
> Sep 10 17:35:07 pcs-wwwclu01-02 attrd[2269]:   notice:
> attrd_trigger_update: Sending flush op to all hosts for:
> last-failure-Website (1473543307)
> Sep 10 17:35:07 pcs-wwwclu01-02 attrd[2269]:   notice:
> attrd_perform_update: Sent update 34: last-failure-Website=1473543307
> Sep 10 17:35:07 pcs-wwwclu01-02 attrd[2269]:   notice:
> attrd_trigger_update: Sending flush op to all hosts for:
> fail-count-Website (INFINITY)
> Sep 10 17:35:07 pcs-wwwclu01-02 pengine[2270]:  warning:
> unpack_rsc_op: Processing failed op stop for Website:0 on
> pcs-wwwclu01-02: unknown error (1)
> Sep 10 17:35:07 pcs-wwwclu01-02 attrd[2269]:   notice:
> attrd_perform_update: Sent update 36: fail-count-Website=INFINITY
> Sep 10 17:35:07 pcs-wwwclu01-02 attrd[2269]:   notice:
> attrd_trigger_update: Sending flush op to all hosts for:
> last-failure-Website (1473543307)
> Sep 10 17:35:07 pcs-wwwclu01-02 attrd[2269]:   notice:
> attrd_perform_update: Sent update 38: last-failure-Website=1473543307
> Sep 10 17:35:07 pcs-wwwclu01-02 pengine[2270]:  warning:
> unpack_rsc_op: Processing failed op stop for Website:0 on
> pcs-wwwclu01-02: unknown error (1)
> Sep 10 17:35:07 pcs-wwwclu01-02 pengine[2270]:  warning:
> common_apply_stickiness: Forcing Website-clone away from
> pcs-wwwclu01-02 after 1000000 failures (max=1000000)
> Sep 10 17:35:07 pcs-wwwclu01-02 pengine[2270]:  warning:
> unpack_rsc_op: Processing failed op stop for Website:0 on
> pcs-wwwclu01-02: unknown error (1)
> Sep 10 17:35:07 pcs-wwwclu01-02 pengine[2270]:  warning:
> common_apply_stickiness: Forcing Website-clone away from
> pcs-wwwclu01-02 after 1000000 failures (max=1000000)
>
> I dont see that pacemaker is waiting for 60 seconds for the apache to
> stop.

.../heartbeat/apache:

graceful_stop()

{

...

        # Try graceful stop for half timeout period if timeout period is present

        if [ -n "$OCF_RESKEY_CRM_meta_timeout" ]; then

                tries=$((($OCF_RESKEY_CRM_meta_timeout/1000) / 2))

        fi

so the 30 seconds from the log are to be expected.
Why it doesn't terminate within this 30 seconds and
why escalation to SIGTERM doesn't help either is
written on another page ...

Do you have logs showing if at the time when stopping
was tried 3561 was really the pid of a running apache?
Don't see the RA (at least the version I have on my
test-cluster) anywhere checking for the running binary
or alike.

> Has anyone encountered something like this before? Or am I missing
> something in the configuration?
>
> Thank you,
> Alex
>
>
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org