[ClusterLabs] service flap as nodes join and leave

Wed Apr 13 20:28:33 UTC 2016

On Wed, Apr 13, 2016, at 12:36 PM, Ken Gaillot wrote:
> On 04/13/2016 11:23 AM, Christopher Harvey wrote:
> > I have a 3 node cluster (see the bottom of this email for 'pcs config'
> > output) with 3 nodes. The MsgBB-Active and AD-Active service both flap
> > whenever a node joins or leaves the cluster. I trigger the leave and
> > join with a pacemaker service start and stop on any node.
> 
> That's the default behavior of clones used in ordering constraints. If
> you set interleave=true on your clones, each dependent clone instance
> will only care about the depended-on instances on its own node, rather
> than all nodes.
> 
> See
> http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#_clone_options
> 
> While the interleave=true behavior is much more commonly used,
> interleave=false is the default because it's safer -- the cluster
> doesn't know anything about the cloned service, so it can't assume the
> service is OK with it. Since you know what your service does, you can
> set interleave=true for services that can handle it.

Hi Ken,

Thanks for pointing out that attribute to me. I applied it as follows:
 Clone: Router-clone
  Meta Attrs: clone-max=2 clone-node-max=1 interleave=true
  Resource: Router (class=ocf provider=solace type=Router)
   Meta Attrs: migration-threshold=1 failure-timeout=1s
   Operations: start interval=0s timeout=2 (Router-start-interval-0s)
               stop interval=0s timeout=2 (Router-stop-interval-0s)
               monitor interval=1s (Router-monitor-interval-1s)

It doesn't seems to change the behavior. Moreover, I found that I can
start/stop the pacemaker instance on the vmr-123-5 node and produce the
same flap on the MsgBB-Active resource on vmr-132-3 node. The Router
clones are never shutdown or started. I would have thought if everything
else in the cluster is constant, vmr-132-5 could never affect resources
on the other two.

> > Here is the happy steady state setup:
> > 
> > 3 nodes and 4 resources configured
> > 
> > Online: [ vmr-132-3 vmr-132-4 vmr-132-5 ]
> > 
> >  Clone Set: Router-clone [Router]
> >      Started: [ vmr-132-3 vmr-132-4 ]
> > MsgBB-Active    (ocf::solace:MsgBB-Active):     Started vmr-132-3
> > AD-Active       (ocf::solace:AD-Active):        Started vmr-132-3
> > 
> > [root at vmr-132-4 ~]# supervisorctl stop pacemaker
> > no change, except vmr-132-4 goes offline
> > [root at vmr-132-4 ~]# supervisorctl start pacemaker
> > vmr-132-4 comes back online
> > MsgBB-Active and AD-Active flap very quickly (<1s)
> > Steady state is resumed.
> > 
> > Why should the fact that vmr-132-4 coming and going affect the service
> > on any other node?
> > 
> > Thanks,
> > Chris
> > 
> > Cluster Name:
> > Corosync Nodes:
> >  192.168.132.5 192.168.132.4 192.168.132.3
> > Pacemaker Nodes:
> >  vmr-132-3 vmr-132-4 vmr-132-5
> > 
> > Resources:
> >  Clone: Router-clone
> >   Meta Attrs: clone-max=2 clone-node-max=1
> >   Resource: Router (class=ocf provider=solace type=Router)
> >    Meta Attrs: migration-threshold=1 failure-timeout=1s
> >    Operations: start interval=0s timeout=2 (Router-start-timeout-2)
> >                stop interval=0s timeout=2 (Router-stop-timeout-2)
> >                monitor interval=1s (Router-monitor-interval-1s)
> >  Resource: MsgBB-Active (class=ocf provider=solace type=MsgBB-Active)
> >   Meta Attrs: migration-threshold=2 failure-timeout=1s
> >   Operations: start interval=0s timeout=2 (MsgBB-Active-start-timeout-2)
> >               stop interval=0s timeout=2 (MsgBB-Active-stop-timeout-2)
> >               monitor interval=1s (MsgBB-Active-monitor-interval-1s)
> >  Resource: AD-Active (class=ocf provider=solace type=AD-Active)
> >   Meta Attrs: migration-threshold=2 failure-timeout=1s
> >   Operations: start interval=0s timeout=2 (AD-Active-start-timeout-2)
> >               stop interval=0s timeout=2 (AD-Active-stop-timeout-2)
> >               monitor interval=1s (AD-Active-monitor-interval-1s)
> > 
> > Stonith Devices:
> > Fencing Levels:
> > 
> > Location Constraints:
> >   Resource: AD-Active
> >     Disabled on: vmr-132-5 (score:-INFINITY) (id:ADNotOnMonitor)
> >   Resource: MsgBB-Active
> >     Enabled on: vmr-132-4 (score:100) (id:vmr-132-4Priority)
> >     Enabled on: vmr-132-3 (score:250) (id:vmr-132-3Priority)
> >     Disabled on: vmr-132-5 (score:-INFINITY) (id:MsgBBNotOnMonitor)
> >   Resource: Router-clone
> >     Disabled on: vmr-132-5 (score:-INFINITY) (id:RouterNotOnMonitor)
> > Ordering Constraints:
> >   Resource Sets:
> >     set Router-clone MsgBB-Active sequential=true
> >     (id:pcs_rsc_set_Router-clone_MsgBB-Active) setoptions kind=Mandatory
> >     (id:pcs_rsc_order_Router-clone_MsgBB-Active)
> >     set MsgBB-Active AD-Active sequential=true
> >     (id:pcs_rsc_set_MsgBB-Active_AD-Active) setoptions kind=Mandatory
> >     (id:pcs_rsc_order_MsgBB-Active_AD-Active)
> > Colocation Constraints:
> >   MsgBB-Active with Router-clone (score:INFINITY)
> >   (id:colocation-MsgBB-Active-Router-clone-INFINITY)
> >   AD-Active with MsgBB-Active (score:1000)
> >   (id:colocation-AD-Active-MsgBB-Active-1000)
> > 
> > Resources Defaults:
> >  No defaults set
> > Operations Defaults:
> >  No defaults set
> > 
> > Cluster Properties:
> >  cluster-infrastructure: corosync
> >  cluster-recheck-interval: 1s
> >  dc-version: 1.1.13-10.el7_2.2-44eb2dd
> >  have-watchdog: false
> >  maintenance-mode: false
> >  start-failure-is-fatal: false
> >  stonith-enabled: false
> 
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org