[Pacemaker] Time to a service stop is very long.

Wed Oct 27 06:36:59 EDT 2010

On Thu, Oct 21, 2010 at 10:30 AM,  <renayama19661014 at ybb.ne.jp> wrote:
> Hi,
>
> We confirmed movement when we set freeze in no-quorum-policy.
> In the cluster that freeze setting became effective, we stopped the service.
>
> However, a stop of the service took time very much.
>
> We set "shutdown-escalation" for five minutes to shorten the time for test.
> But, a stop of the service of one node takes time more than five minutes.
>
> I confirmed it in the next procedure.
>
> Step1) Start four nodes and send cib.xml.
> Step2) Intercept Heartbeat communication and divide it in two nodes.
> Step3) The node does freeze.
> Step4) In two divided one nodes, we stop Hearbeat at the same time.
>
> [root at srv03 ~]# service heartbeat stop
> Stopping High-Availability services:
> [root at srv04 ~]# service heartbeat stop
> Stopping High-Availability services:
>
> Step5) Heartbeat of one node stops in a few minutes.
> [root at srv04 ~]# service heartbeat stop
> Stopping High-Availability services:                       [  OK  ]
>
> Step6) But, Heartbeat of one node does not stop anymore unless, furthermore, time passes.
>  * The timer of shutdown-escalation starts, but time when we set it(5min) does not seem to become
> effective.
>
> [root at srv03 ~]# service heartbeat stop
> Stopping High-Availability services:                       [  OK  ]
>
> Oct 21 16:46:57 srv03 crmd: [4432]: info: do_shutdown_req: Sending shutdown request to DC: srv03
> Oct 21 16:46:57 srv03 crmd: [4432]: info: handle_shutdown_request: Creating shutdown request for srv03
> (state=S_IDLE)
> Oct 21 16:53:07 srv03 cib: [4428]: info: cib_stats: Processed 805 operations (38149.00us average, 5%
> utilization) in the last 10min
> Oct 21 16:57:20 srv03 crmd: [4432]: ERROR: crm_timer_popped: Shutdown Escalation (I_STOP) just popped!
> Oct 21 16:57:20 srv03 crmd: [4432]: ERROR: do_log: FSA: Input I_STOP from crm_timer_popped() received
> in state S_IDLE
> Oct 21 16:57:20 srv03 crmd: [4432]: info: do_state_transition: State transition S_IDLE -> S_STOPPING [
> input=I_STOP cause=C_TIMER_POPPED origin=crm_timer_popped ]
> Oct 21 16:57:20 srv03 crmd: [4432]: info: do_dc_release: DC role released
> Oct 21 16:57:20 srv03 crmd: [4432]: info: stop_subsystem: Sent -TERM to pengine: [5007]
>
>
> Is it right movement to take time to this service stop?

It's what I would expect to happen, but its possibly not ideal.

>  * Because the log was very big, I did not attach it.
>  * If log is necessary, I send it in Bugzilla.
>
> Best Regards,
> Hideo Yamauchi.
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>