[Pacemaker] pacemaker shutdown waits for a failover

Mon Jul 28 19:12:12 EDT 2014

On 28 Jul 2014, at 5:07 pm, Liron Amitzi <LironA at imperva.com> wrote:

> When I run "service pacemaker stop" it takes a long time, I see that it stops all the resources, then starts them on the other node, and only then the "stop" command is completed.

Ahhh! It was the DC.

It appears to be deliberate, I found this commit from 2008 where the behaviour was introduced:
   https://github.com/beekhof/pacemaker/commit/7bf55f0

I could change it, but I'm no longer sure this would be a good idea as it would increase service downtime.
(Electing and bootstrapping a new DC introduces additional delays before the cluster can bring up any resources).

I assume there is a particular resource that takes a long time to start?

> I have 3 resources, IP, OracleDB and JavaSrv
> 
> This is the output on the screen:
> [root at ha1 ~]# service pacemaker stop
> Signaling Pacemaker Cluster Manager to terminate:          [  OK  ]
> Waiting for cluster services to unload:....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................                                              [  OK  ]
> [root at ha1 ~]#
> 
> And these are parts of the log (/var/log/cluster/corosync.log):
> Jun 29 15:14:15 [28031] ha1    pengine:   notice: stage6:  Scheduling Node ha1 for shutdown
> Jun 29 15:14:15 [28031] ha1    pengine:   notice: LogActions:      Move    ip_resource     (Started ha1 -> ha2)
> Jun 29 15:14:15 [28031] ha1    pengine:   notice: LogActions:      Move    OracleDB        (Started ha1 -> ha2)
> Jun 29 15:14:15 [28031] ha1    pengine:   notice: LogActions:      Move    JavaSrv    (Started ha1 -> ha2)
> Jun 29 15:14:15 [28032] ha1       crmd:     info: te_rsc_command:  Initiating action 12: stop JavaSrv_stop_0 on ha1 (local)
> Jun 29 15:14:15 ha1 lrmd: [28029]: info: rsc:JavaSrv:16: stop
> ...
> Jun 29 15:14:41 [28032] ha1       crmd:     info: process_lrm_event:       LRM operation JavaSrv_stop_0 (call=16, rc=0, cib-update=447, confirmed=true) ok
> Jun 29 15:14:41 [28032] ha1       crmd:     info: te_rsc_command:  Initiating action 9: stop OracleDB_stop_0 on ha1 (local)
> Jun 29 15:14:41 ha1 lrmd: [28029]: info: cancel_op: operation monitor[13] on lsb::ha-dbora::OracleDB for client 28032, its parameters: CRM_meta_name=[monitor] crm_feature_set=[3.0.6] CRM_meta_timeout=[600000] CRM_meta_interval=[60000]  cancelled
> Jun 29 15:14:41 ha1 lrmd: [28029]: info: rsc:OracleDB:17: stop
> ...
> Jun 29 15:15:08 [28032] ha1       crmd:     info: process_lrm_event:       LRM operation OracleDB_stop_0 (call=17, rc=0, cib-update=448, confirmed=true) ok
> Jun 29 15:15:08 [28032] ha1       crmd:     info: te_rsc_command:  Initiating action 7: stop ip_resource_stop_0 on ha1 (local)
> ...
> Jun 29 15:15:08 [28032] ha1       crmd:     info: process_lrm_event:       LRM operation ip_resource_stop_0 (call=18, rc=0, cib-update=449, confirmed=true) ok
> Jun 29 15:15:08 [28032] ha1       crmd:     info: te_rsc_command:  Initiating action 8: start ip_resource_start_0 on ha2
> Jun 29 15:15:08 [28032] ha1       crmd:     info: te_crm_command:  Executing crm-event (21): do_shutdown on ha1
> Jun 29 15:15:08 [28032] ha1       crmd:     info: te_crm_command:  crm-event (21) is a local shutdown
> Jun 29 15:15:09 [28032] ha1       crmd:     info: te_rsc_command:  Initiating action 10: start OracleDB_start_0 on ha2
> Jun 29 15:15:51 [28032] ha1       crmd:     info: te_rsc_command:  Initiating action 11: monitor OracleDB_monitor_60000 on ha2
> Jun 29 15:15:51 [28032] ha1       crmd:     info: te_rsc_command:  Initiating action 13: start JavaSrv_start_0 on ha2
> ...
> Jun 29 15:27:09 [28023] ha1 pacemakerd:     info: pcmk_child_exit:         Child process cib exited (pid=28027, rc=0)
> Jun 29 15:27:09 [28023] ha1 pacemakerd:   notice: pcmk_shutdown_worker:    Shutdown complete
> Jun 29 15:27:09 [28023] ha1 pacemakerd:     info: main:    Exiting pacemakerd
> 
> 
> 
> ________________________________________
> From: Andrew Beekhof <andrew at beekhof.net>
> Sent: Monday, July 28, 2014 2:08
> To: The Pacemaker cluster resource manager
> Subject: Re: [Pacemaker] pacemaker shutdown waits for a failover
> 
> On 28 Jul 2014, at 12:40 am, Liron Amitzi <LironA at imperva.com> wrote:
> 
>> Hi guys,
>> I'm working with pacemaker 1.1.7-6 with corosync 1.4.1-15 (2 nodes) and facing a strange behavior.
>> I have several resources including Oracle database, and when I try to stop the pacemaker or reboot the active node it takes a very long time. I checked it and it seems that pacemaker waits until the failover is complete before stopping. I expect it to stop the resources, initiate the failover and stop, not wait until everything is up on the other node.
> 
> Thats what I would expect too.
> Can you show us something that would suggest this isn't happening?
> 
>> Am i missing something? Is this expected?
>> Thanks,
>> Liron
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> 
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20140729/d37e6dc2/attachment-0003.sig>