[ClusterLabs] Pacemaker Active-Active setup monitor problem

Mon Sep 12 09:45:21 EDT 2016

>From the error_log at the same time:
[Sat Sep 10 17:35:05 2016] [notice] caught SIGWINCH, shutting down gracefully
As far as the failure I have traced that to MaxClients limit reached, I can adjust that.
However I am still concerned about the restart that it didn't work properly and the resource became un-managed.
Is there anything I can do to prevent that?
Thank you,Alex 

    On Monday, September 12, 2016 2:24 PM, Klaus Wenninger <kwenning at redhat.com> wrote:
 

On 09/12/2016 03:00 PM, Alex wrote:
> Hi Klaus,
>
> Thanks for the reply.
>
> I dont have any logs to indicate that was indeed the PID of apache but
> I believe apache was killed successfully as I logged on the server
> apache wasn't running.

Reason for me asking was rather if it might have been dead
already before and some other process had taken its' pid.
That would both be a reason for the monitor to fail and
as well for the more graceful ways of stopping to fail.
>
> I am running:
> corosync-2.3.2-2
> pacemaker-1.1.10-19
>
> Thanks,
> Alex
>
>
> On Monday, September 12, 2016 1:03 PM, Klaus Wenninger
> <kwenning at redhat.com> wrote:
>
>
> On 09/12/2016 12:55 PM, Alex wrote:
> > Hi all,
> >
> > I am having a problem with one of our pacemaker clusters that is
> > running in an active-active configuration.
> >
> > Sometimes the Website monitor will timeout, triggering and apache
> > restart that fails. That will increase the fail-count to INFINITY for
> > the Website resource and make in un-managed. I have tried the
> > following changes:
> >
> > pcs property set start-failure-is-fatal=false
> >
> > increasing the stop timeout monitor on the Website resource:
> > pcs resource op add Website stop interval=0s timeout=60s
> >
> > Here is the resource configuration:
> >  Resource: Website (class=ocf provider=heartbeat type=apache)
> >  Attributes: configfile=/etc/httpd/conf/httpd.conf
> > statusurl=http://localhost/server-status
> >  Operations: start on-fail=restart interval=0s timeout=60s
> > (Website-name-start-interval-0s-on-fail-restart-timeout-60s)
> >              monitor on-fail=restart interval=1min timeout=40s
> > (Website-name-monitor-interval-1min-on-fail-restart-timeout-40s)
> >              stop interval=0s timeout=60s
> > (Website-name-stop-interval-0s-timeout-60s)
> >
> > Here is what I see in the logs when it fails:
> > Sep 10 17:34:25 pcs-wwwclu01-02 lrmd[2268]:  warning:
> > child_timeout_callback: Website_monitor_60000 process (PID 10352)
> > timed out
> > Sep 10 17:34:25 pcs-wwwclu01-02 lrmd[2268]:  warning:
> > operation_finished: Website_monitor_60000:10352 - timed out after
> 40000ms
> > Sep 10 17:34:25 pcs-wwwclu01-02 crmd[2271]:    error:
> > process_lrm_event: LRM operation Website_monitor_60000 (32) Timed Out
> > (timeout=40000ms)
> > Sep 10 17:34:25 pcs-wwwclu01-02 crmd[2271]:  warning:
> > update_failcount: Updating failcount for Website on pcs-wwwclu01-02
> > after failed monitor: rc=1 (update=value++, time=1473543265)
> > Sep 10 17:34:25 pcs-wwwclu01-02 attrd[2269]:  notice:
> > attrd_trigger_update: Sending flush op to all hosts for:
> > fail-count-Website (1)
> > Sep 10 17:34:25 pcs-wwwclu01-02 attrd[2269]:  notice:
> > attrd_perform_update: Sent update 27: fail-count-Website=1
> > Sep 10 17:34:25 pcs-wwwclu01-02 attrd[2269]:  notice:
> > attrd_trigger_update: Sending flush op to all hosts for:
> > last-failure-Website (1473543265)
> > Sep 10 17:34:25 pcs-wwwclu01-02 pengine[2270]:  warning:
> > unpack_rsc_op: Processing failed op monitor for Website:0 on
> > pcs-wwwclu01-02: unknown error (1)
> > Sep 10 17:34:25 pcs-wwwclu01-02 pengine[2270]:  notice: LogActions:
> > Recover Website:0#011(Started pcs-wwwclu01-02)
> > Sep 10 17:34:25 pcs-wwwclu01-02 attrd[2269]:  notice:
> > attrd_perform_update: Sent update 30: last-failure-Website=1473543265
> > Sep 10 17:34:25 pcs-wwwclu01-02 pengine[2270]:  warning:
> > unpack_rsc_op: Processing failed op monitor for Website:0 on
> > pcs-wwwclu01-02: unknown error (1)
> > Sep 10 17:34:25 pcs-wwwclu01-02 pengine[2270]:  notice: LogActions:
> > Recover Website:0#011(Started pcs-wwwclu01-02)
> > Sep 10 17:34:25 pcs-wwwclu01-02 pengine[2270]:  warning:
> > unpack_rsc_op: Processing failed op monitor for Website:0 on
> > pcs-wwwclu01-02: unknown error (1)
> > Sep 10 17:34:25 pcs-wwwclu01-02 pengine[2270]:  notice: LogActions:
> > Recover Website:0#011(Started pcs-wwwclu01-02)
> > Sep 10 17:34:25 pcs-wwwclu01-02 crmd[2271]:  notice: te_rsc_command:
> > Initiating action 2: stop Website_stop_0 on pcs-wwwclu01-02 (local)
> > Sep 10 17:34:25 pcs-wwwclu01-02 apache(Website)[10443]: INFO:
> > Attempting graceful stop of apache PID 3561
> > Sep 10 17:34:55 pcs-wwwclu01-02 apache(Website)[10443]: INFO: Killing
> > apache PID 3561
> > Sep 10 17:35:04 pcs-wwwclu01-02 apache(Website)[10443]: INFO: apache
> > still running (3561). Killing pid failed.
> > Sep 10 17:35:04 pcs-wwwclu01-02 apache(Website)[10443]: INFO: apache
> > children were signalled (SIGTERM)
> > Sep 10 17:35:06 pcs-wwwclu01-02 apache(Website)[10443]: INFO: apache
> > children were signalled (SIGHUP)
> > Sep 10 17:35:07 pcs-wwwclu01-02 crmd[2271]:  notice:
> > process_lrm_event: LRM operation Website_stop_0 (call=34, rc=1,
> > cib-update=3097, confirmed=true) unknown error
> > Sep 10 17:35:07 pcs-wwwclu01-02 crmd[2271]:  warning: status_from_rc:
> > Action 2 (Website_stop_0) on pcs-wwwclu01-02 failed (target: 0 vs. rc:
> > 1): Error
> > Sep 10 17:35:07 pcs-wwwclu01-02 crmd[2271]:  warning:
> > update_failcount: Updating failcount for Website on pcs-wwwclu01-02
> > after failed stop: rc=1 (update=INFINITY, time=1473543307)
> > Sep 10 17:35:07 pcs-wwwclu01-02 attrd[2269]:  notice:
> > attrd_trigger_update: Sending flush op to all hosts for:
> > fail-count-Website (INFINITY)
> > Sep 10 17:35:07 pcs-wwwclu01-02 attrd[2269]:  notice:
> > attrd_perform_update: Sent update 32: fail-count-Website=INFINITY
> > Sep 10 17:35:07 pcs-wwwclu01-02 attrd[2269]:  notice:
> > attrd_trigger_update: Sending flush op to all hosts for:
> > last-failure-Website (1473543307)
> > Sep 10 17:35:07 pcs-wwwclu01-02 attrd[2269]:  notice:
> > attrd_perform_update: Sent update 34: last-failure-Website=1473543307
> > Sep 10 17:35:07 pcs-wwwclu01-02 attrd[2269]:  notice:
> > attrd_trigger_update: Sending flush op to all hosts for:
> > fail-count-Website (INFINITY)
> > Sep 10 17:35:07 pcs-wwwclu01-02 pengine[2270]:  warning:
> > unpack_rsc_op: Processing failed op stop for Website:0 on
> > pcs-wwwclu01-02: unknown error (1)
> > Sep 10 17:35:07 pcs-wwwclu01-02 attrd[2269]:  notice:
> > attrd_perform_update: Sent update 36: fail-count-Website=INFINITY
> > Sep 10 17:35:07 pcs-wwwclu01-02 attrd[2269]:  notice:
> > attrd_trigger_update: Sending flush op to all hosts for:
> > last-failure-Website (1473543307)
> > Sep 10 17:35:07 pcs-wwwclu01-02 attrd[2269]:  notice:
> > attrd_perform_update: Sent update 38: last-failure-Website=1473543307
> > Sep 10 17:35:07 pcs-wwwclu01-02 pengine[2270]:  warning:
> > unpack_rsc_op: Processing failed op stop for Website:0 on
> > pcs-wwwclu01-02: unknown error (1)
> > Sep 10 17:35:07 pcs-wwwclu01-02 pengine[2270]:  warning:
> > common_apply_stickiness: Forcing Website-clone away from
> > pcs-wwwclu01-02 after 1000000 failures (max=1000000)
> > Sep 10 17:35:07 pcs-wwwclu01-02 pengine[2270]:  warning:
> > unpack_rsc_op: Processing failed op stop for Website:0 on
> > pcs-wwwclu01-02: unknown error (1)
> > Sep 10 17:35:07 pcs-wwwclu01-02 pengine[2270]:  warning:
> > common_apply_stickiness: Forcing Website-clone away from
> > pcs-wwwclu01-02 after 1000000 failures (max=1000000)
> >
> > I dont see that pacemaker is waiting for 60 seconds for the apache to
> > stop.
>
> .../heartbeat/apache:
>
> graceful_stop()
>
> {
>
> ...
>
>        # Try graceful stop for half timeout period if timeout period
> is present
>
>        if [ -n "$OCF_RESKEY_CRM_meta_timeout" ]; then
>
>                tries=$((($OCF_RESKEY_CRM_meta_timeout/1000) / 2))
>
>        fi
>
> so the 30 seconds from the log are to be expected.
> Why it doesn't terminate within this 30 seconds and
> why escalation to SIGTERM doesn't help either is
> written on another page ...
>
> Do you have logs showing if at the time when stopping
> was tried 3561 was really the pid of a running apache?
> Don't see the RA (at least the version I have on my
> test-cluster) anywhere checking for the running binary
> or alike.
>
>
> > Has anyone encountered something like this before? Or am I missing
> > something in the configuration?
> >
> > Thank you,
> > Alex
>
> >
> >
> >
> >
> > _______________________________________________
> > Users mailing list: Users at clusterlabs.org <mailto:Users at clusterlabs.org>
> > http://clusterlabs.org/mailman/listinfo/users
> >
> > Project Home: http://www.clusterlabs.org <http://www.clusterlabs.org/>
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org <http://bugs.clusterlabs.org/>
>
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org <mailto:Users at clusterlabs.org>
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org <http://www.clusterlabs.org/>
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org <http://bugs.clusterlabs.org/>
>
>
>


_______________________________________________
Users mailing list: Users at clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20160912/1a326bd9/attachment-0007.html>