[Pacemaker] Time to a service stop is very long.

renayama19661014 at ybb.ne.jp renayama19661014 at ybb.ne.jp
Thu Oct 21 04:30:55 EDT 2010


We confirmed movement when we set freeze in no-quorum-policy.
In the cluster that freeze setting became effective, we stopped the service.

However, a stop of the service took time very much.

We set "shutdown-escalation" for five minutes to shorten the time for test.
But, a stop of the service of one node takes time more than five minutes.

I confirmed it in the next procedure.

Step1) Start four nodes and send cib.xml.
Step2) Intercept Heartbeat communication and divide it in two nodes.
Step3) The node does freeze.
Step4) In two divided one nodes, we stop Hearbeat at the same time.

[root at srv03 ~]# service heartbeat stop
Stopping High-Availability services:                       
[root at srv04 ~]# service heartbeat stop
Stopping High-Availability services:                       

Step5) Heartbeat of one node stops in a few minutes.
[root at srv04 ~]# service heartbeat stop
Stopping High-Availability services:                       [  OK  ]

Step6) But, Heartbeat of one node does not stop anymore unless, furthermore, time passes.
 * The timer of shutdown-escalation starts, but time when we set it(5min) does not seem to become

[root at srv03 ~]# service heartbeat stop
Stopping High-Availability services:                       [  OK  ] 

Oct 21 16:46:57 srv03 crmd: [4432]: info: do_shutdown_req: Sending shutdown request to DC: srv03
Oct 21 16:46:57 srv03 crmd: [4432]: info: handle_shutdown_request: Creating shutdown request for srv03
Oct 21 16:53:07 srv03 cib: [4428]: info: cib_stats: Processed 805 operations (38149.00us average, 5%
utilization) in the last 10min
Oct 21 16:57:20 srv03 crmd: [4432]: ERROR: crm_timer_popped: Shutdown Escalation (I_STOP) just popped!
Oct 21 16:57:20 srv03 crmd: [4432]: ERROR: do_log: FSA: Input I_STOP from crm_timer_popped() received
in state S_IDLE
Oct 21 16:57:20 srv03 crmd: [4432]: info: do_state_transition: State transition S_IDLE -> S_STOPPING [
input=I_STOP cause=C_TIMER_POPPED origin=crm_timer_popped ]
Oct 21 16:57:20 srv03 crmd: [4432]: info: do_dc_release: DC role released
Oct 21 16:57:20 srv03 crmd: [4432]: info: stop_subsystem: Sent -TERM to pengine: [5007]

Is it right movement to take time to this service stop?

 * Because the log was very big, I did not attach it. 
 * If log is necessary, I send it in Bugzilla.

Best Regards,
Hideo Yamauchi.

More information about the Pacemaker mailing list