[ClusterLabs] Cannot stop cluster due to order constraint

Fri Sep 15 13:30:31 EDT 2017

On Fri, 2017-09-08 at 15:31 +1000, Leon Steffens wrote:
> Hi all,
> 
> We are running Pacemaker 1.1.15 under Centos 6.9, and have a simple
> 3-node cluster with 6 sets of "main" and "backup" resources (just
> Dummy ones):
> 
> main1
> backup1
> main2
> backup2
> etc.
> 
> We have the following co-location constraint between main1 and
> backup1 (-200 because we don't want them to be on the same node, but
> under some circumstances they can end up on the same node)
> 
> pcs constraint colocation add backup1 with main1 -200
> 
> We also have the following order constraint between main1 and
> backup1.  This caters for the scenario where they end up on the same
> node - we want to make sure that "main" gets started before "backup"
> gets stopped, and started somewhere else (because of co-location
> score):
> 
> pcs constraint order start main1 then stop backup1 kind=Serialize

I think you want kind=Optional here. "Optional" means that if both
actions are needed in the same transition, perform them in this order,
otherwise it doesn't limit anything. "Serialize" means the start and
stop can happen in either order, but not simultaneously, and backup1
can't stop unless main1 is starting.

> When the cluster is started, everything works fine:
> 
> main1   (ocf::heartbeat:Dummy): Started straddie1
> main2   (ocf::heartbeat:Dummy): Started straddie2
> main3   (ocf::heartbeat:Dummy): Started straddie3
> main4   (ocf::heartbeat:Dummy): Started straddie1
> main5   (ocf::heartbeat:Dummy): Started straddie2
> main6   (ocf::heartbeat:Dummy): Started straddie3
> backup1 (ocf::heartbeat:Dummy): Started straddie2
> backup2 (ocf::heartbeat:Dummy): Started straddie1
> backup3 (ocf::heartbeat:Dummy): Started straddie1
> backup4 (ocf::heartbeat:Dummy): Started straddie2
> backup5 (ocf::heartbeat:Dummy): Started straddie1
> backup6 (ocf::heartbeat:Dummy): Started straddie2
> 
> When we do a "pcs cluster stop --all", things do not go so well.  pcs
> cluster stop hangs and the cluster state is as follows:
> 
> main1   (ocf::heartbeat:Dummy): Stopped
> main2   (ocf::heartbeat:Dummy): Stopped
> main3   (ocf::heartbeat:Dummy): Stopped
> main4   (ocf::heartbeat:Dummy): Stopped
> main5   (ocf::heartbeat:Dummy): Stopped
> main6   (ocf::heartbeat:Dummy): Stopped
> backup1 (ocf::heartbeat:Dummy): Started straddie2
> backup2 (ocf::heartbeat:Dummy): Started straddie1
> backup3 (ocf::heartbeat:Dummy): Started straddie1
> backup4 (ocf::heartbeat:Dummy): Started straddie2
> backup5 (ocf::heartbeat:Dummy): Started straddie1
> backup6 (ocf::heartbeat:Dummy): Started straddie2
> 
> The corosync.log clearly shows why this is happening.  It looks like
> Pacemaker wants to stop the backup resources, but the order
> constraint states that the "main" resources should be started first. 
> At this stage the "main" resources have already been stopped, and
> because the cluster is shutting down, the "main" resources cannot be
> started, and we are stuck:
> 
> 
> Sep 08 15:15:07 [23862] straddie3       crmd:     info:
> match_graph_event:      Action main1_stop_0 (14) confirmed on
> straddie1 (rc=0)
> Sep 08 15:15:07 [23862] straddie3       crmd:  warning: run_graph:  
>    Transition 48 (Complete=6, Pending=0, Fired=0, Skipped=0,
> Incomplete=10, Source=/var/lib/pacemaker/pengine/pe-input-496.bz2):
> Terminated
> Sep 08 15:15:07 [23862] straddie3       crmd:  warning:
> te_graph_trigger:       Transition failed: terminated
> Sep 08 15:15:07 [23862] straddie3       crmd:   notice: print_graph:
>    Graph 48 with 16 actions: batch-limit=0 jobs, network-
> delay=60000ms
> Sep 08 15:15:07 [23862] straddie3       crmd:   notice:
> print_synapse:  [Action   14]: Completed rsc op main1_stop_0        
>              on straddie1 (priority: 0, waiting: none)
> Sep 08 15:15:07 [23862] straddie3       crmd:   notice:
> print_synapse:  [Action   15]: Completed rsc op main4_stop_0        
>              on straddie1 (priority: 0, waiting: none)
> Sep 08 15:15:07 [23862] straddie3       crmd:   notice:
> print_synapse:  [Action   16]: Pending rsc op backup2_stop_0        
>              on straddie1 (priority: 0, waiting: none)
> Sep 08 15:15:07 [23862] straddie3       crmd:   notice:
> print_synapse:   * [Input 31]: Unresolved dependency rsc op
> main2_start_0
> Sep 08 15:15:07 [23862] straddie3       crmd:   notice:
> print_synapse:  [Action   17]: Pending rsc op backup3_stop_0        
>              on straddie1 (priority: 0, waiting: none)
> Sep 08 15:15:07 [23862] straddie3       crmd:   notice:
> print_synapse:   * [Input 32]: Unresolved dependency rsc op
> main3_start_0
> Sep 08 15:15:07 [23862] straddie3       crmd:   notice:
> print_synapse:  [Action   18]: Pending rsc op backup5_stop_0        
>              on straddie1 (priority: 0, waiting: none)
> Sep 08 15:15:07 [23862] straddie3       crmd:   notice:
> print_synapse:   * [Input 34]: Unresolved dependency rsc op
> main5_start_0
> Sep 08 15:15:07 [23862] straddie3       crmd:   notice:
> print_synapse:  [Action   19]: Completed rsc op main2_stop_0        
>              on straddie2 (priority: 0, waiting: none)
> Sep 08 15:15:07 [23862] straddie3       crmd:   notice:
> print_synapse:  [Action   20]: Completed rsc op main5_stop_0        
>              on straddie2 (priority: 0, waiting: none)
> Sep 08 15:15:07 [23862] straddie3       crmd:   notice:
> print_synapse:  [Action   21]: Pending rsc op backup1_stop_0        
>              on straddie2 (priority: 0, waiting: none)
> Sep 08 15:15:07 [23862] straddie3       crmd:   notice:
> print_synapse:   * [Input 30]: Unresolved dependency rsc op
> main1_start_0
> Sep 08 15:15:07 [23862] straddie3       crmd:   notice:
> print_synapse:  [Action   22]: Pending rsc op backup4_stop_0        
>              on straddie2 (priority: 0, waiting: none)
> Sep 08 15:15:07 [23862] straddie3       crmd:   notice:
> print_synapse:   * [Input 33]: Unresolved dependency rsc op
> main4_start_0
> Sep 08 15:15:07 [23862] straddie3       crmd:   notice:
> print_synapse:  [Action   23]: Pending rsc op backup6_stop_0        
>              on straddie2 (priority: 0, waiting: none)
> Sep 08 15:15:07 [23862] straddie3       crmd:   notice:
> print_synapse:   * [Input 35]: Unresolved dependency rsc op
> main6_start_0
> Sep 08 15:15:07 [23862] straddie3       crmd:   notice:
> print_synapse:  [Action   24]: Completed rsc op main3_stop_0        
>              on straddie3 (priority: 0, waiting: none)
> Sep 08 15:15:07 [23862] straddie3       crmd:   notice:
> print_synapse:  [Action   25]: Completed rsc op main6_stop_0        
>              on straddie3 (priority: 0, waiting: none)
> Sep 08 15:15:07 [23862] straddie3       crmd:   notice:
> print_synapse:  [Action   29]: Pending crm op do_shutdown-straddie3  
>             on straddie3 (priority: 0, waiting:  27 28)
> Sep 08 15:15:07 [23862] straddie3       crmd:   notice:
> print_synapse:  [Action   28]: Pending crm op do_shutdown-straddie2  
>             on straddie2 (priority: 0, waiting:  21 22 23)
> Sep 08 15:15:07 [23862] straddie3       crmd:   notice:
> print_synapse:  [Action   27]: Pending crm op do_shutdown-straddie1  
>             on straddie1 (priority: 0, waiting:  16 17 18)
> Sep 08 15:15:07 [23862] straddie3       crmd:   notice:
> print_synapse:  [Action   13]: Pending pseudo op all_stopped        
>              on N/A (priority: 0, waiting:  16 17 18 21 22 23)
> Sep 08 15:15:07 [23862] straddie3       crmd:     info: do_log: Input
> I_TE_SUCCESS received in state S_TRANSITION_ENGINE from notify_crmd
> Sep 08 15:15:07 [23862] straddie3       crmd:   notice:
> do_state_transition:    State transition S_TRANSITION_ENGINE ->
> S_IDLE | input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd
> Sep 08 15:15:07 [23862] straddie3       crmd:     info:
> do_state_transition:    (Re)Issuing shutdown request now that we are
> the DC
> Sep 08 15:15:07 [23862] straddie3       crmd:     info:
> do_shutdown_req:        Sending shutdown request to straddie3
> Sep 08 15:15:07 [23862] straddie3       crmd:     info:
> handle_shutdown_request:        Creating shutdown request for
> straddie3 (state=S_IDLE)
> 
> 
> Our current workaround is to delete the constraints before calling
> "pcs cluster stop --all", but we would prefer not to do that.
> 
> If I add "symmetrical=false" it seems to work fine, but we need the
> constraint to work in both directions.  I've tried adding a separate
> order constraint for "start backup then stop main kind=Serialized",
> but I hit the same issue.
> 
> I've also added another optional order constraint between main and
> backup to say backup must be stopped first before stopping partition,
> but this didn't seem to work.
> 
> Does anyone have any ideas on how to solve this?
> 
> Thanks,
> Leon
> 
> 
> 
> PS: The full script to create the resources on 3 nodes is:
> 
> 
> echo "Creating main and backup"
> pcs resource create main1 ocf:heartbeat:Dummy
> pcs resource create main2 ocf:heartbeat:Dummy
> pcs resource create main3 ocf:heartbeat:Dummy
> pcs resource create main4 ocf:heartbeat:Dummy
> pcs resource create main5 ocf:heartbeat:Dummy
> pcs resource create main6 ocf:heartbeat:Dummy
> 
> pcs resource create backup1 ocf:heartbeat:Dummy
> pcs resource create backup2 ocf:heartbeat:Dummy
> pcs resource create backup3 ocf:heartbeat:Dummy
> pcs resource create backup4 ocf:heartbeat:Dummy
> pcs resource create backup5 ocf:heartbeat:Dummy
> pcs resource create backup6 ocf:heartbeat:Dummy
> 
> pcs constraint order start main1 then stop backup1 kind=Serialize
> pcs constraint order start main2 then stop backup2 kind=Serialize
> pcs constraint order start main3 then stop backup3 kind=Serialize
> pcs constraint order start main4 then stop backup4 kind=Serialize
> pcs constraint order start main5 then stop backup5 kind=Serialize
> pcs constraint order start main6 then stop backup6 kind=Serialize
> 
> pcs constraint colocation add backup1 with main1 -200
> pcs constraint colocation add backup2 with main2 -200
> pcs constraint colocation add backup3 with main3 -200
> pcs constraint colocation add backup4 with main3 -200
> pcs constraint colocation add backup5 with main3 -200
> pcs constraint colocation add backup6 with main3 -200
> 
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.
> pdf
> Bugs: http://bugs.clusterlabs.org
-- 
Ken Gaillot <kgaillot at redhat.com>