[ClusterLabs] In N+1 cluster, add/delete of one resource result in other node resources to restart

Tue May 23 00:51:34 CEST 2017

On 05/16/2017 04:34 AM, Anu Pillai wrote:
> Hi,
> 
> Please find attached debug logs for the stated problem as well as
> crm_mon command outputs. 
> In this case we are trying to remove/delete res3 and system/node
> (0005B94238BC) from the cluster.
> 
> *_Test reproduction steps_*
> 
> Current Configuration of the cluster:
>  0005B9423910  - res2 
>  0005B9427C5A - res1
>  0005B94238BC - res3
> 
> *crm_mon output:*
> 
> Defaulting to one-shot mode
> You need to have curses available at compile time to enable console mode
> Stack: corosync
> Current DC: 0005B9423910 (version 1.1.14-5a6cdd1) - partition with quorum
> Last updated: Tue May 16 12:21:23 2017          Last change: Tue May 16
> 12:13:40 2017 by root via crm_attribute on 0005B9423910
> 
> 3 nodes and 3 resources configured
> 
> Online: [ 0005B94238BC 0005B9423910 0005B9427C5A ]
> 
>  res2   (ocf::redundancy:RedundancyRA): Started 0005B9423910
>  res1   (ocf::redundancy:RedundancyRA): Started 0005B9427C5A
>  res3   (ocf::redundancy:RedundancyRA): Started 0005B94238BC
> 
> 
> Trigger the delete operation for res3 and node 0005B94238BC.
> 
> Following commands applied from node 0005B94238BC
> $ pcs resource delete res3 --force
> $ crm_resource -C res3
> $ pcs cluster stop --force 

I don't think "pcs resource delete" or "pcs cluster stop" does anything
with the --force option. In any case, --force shouldn't be needed here.

The crm_mon output you see is actually not what it appears. It starts with:

May 16 12:21:27 [4661] 0005B9423910       crmd:   notice: do_lrm_invoke:
       Forcing the status of all resources to be redetected

This is usually the result of a "cleanup all" command. It works by
erasing the resource history, causing pacemaker to re-probe all nodes to
get the current state. The history erasure makes it appear to crm_mon
that the resources are stopped, but they actually are not.

In this case, I'm not sure why it's doing a "cleanup all", since you
only asked it to cleanup res3. Maybe in this particular instance, you
actually did "crm_resource -C"?

> Following command applied from DC(0005B9423910)
> $ crm_node -R 0005B94238BC --force

This can cause problems. This command shouldn't be run unless the node
is removed from both pacemaker's and corosync's configuration. If you
actually are trying to remove the node completely, a better alternative
would be "pcs cluster node remove 0005B94238BC", which will handle all
of that for you. If you're not trying to remove the node completely,
then you shouldn't need this command at all.

> 
> 
> *crm_mon output:*
> *
> *
> Defaulting to one-shot mode
> You need to have curses available at compile time to enable console mode
> Stack: corosync
> Current DC: 0005B9423910 (version 1.1.14-5a6cdd1) - partition with quorum
> Last updated: Tue May 16 12:21:27 2017          Last change: Tue May 16
> 12:21:26 2017 by root via cibadmin on 0005B94238BC
> 
> 3 nodes and 2 resources configured
> 
> Online: [ 0005B94238BC 0005B9423910 0005B9427C5A ]
> 
> 
> Observation is remaining two resources res2 and res1 were stopped and
> started.
> 
> 
> Regards,
> Aswathi
> 
> On Mon, May 15, 2017 at 8:11 PM, Ken Gaillot <kgaillot at redhat.com
> <mailto:kgaillot at redhat.com>> wrote:
> 
>     On 05/15/2017 06:59 AM, Klaus Wenninger wrote:
>     > On 05/15/2017 12:25 PM, Anu Pillai wrote:
>     >> Hi Klaus,
>     >>
>     >> Please find attached cib.xml as well as corosync.conf.
> 
>     Maybe you're only setting this while testing, but having
>     stonith-enabled=false and no-quorum-policy=ignore is highly dangerous in
>     any kind of network split.
> 
>     FYI, default-action-timeout is deprecated in favor of setting a timeout
>     in op_defaults, but it doesn't hurt anything.
> 
>     > Why wouldn't you keep placement-strategy with default
>     > to keep things simple. You aren't using any load-balancing
>     > anyway as far as I understood it.
> 
>     It looks like the intent is to use placement-strategy to limit each node
>     to 1 resource. The configuration looks good for that.
> 
>     > Haven't used resource-stickiness=INF. No idea which strange
>     > behavior that triggers. Try to have it just higher than what
>     > the other scores might some up to.
> 
>     Either way would be fine. Using INFINITY ensures that no other
>     combination of scores will override it.
> 
>     > I might have overseen something in your scores but otherwise
>     > there is nothing obvious to me.
>     >
>     > Regards,
>     > Klaus
> 
>     I don't see anything obvious either. If you have logs around the time of
>     the incident, that might help.
> 
>     >> Regards,
>     >> Aswathi
>     >>
>     >> On Mon, May 15, 2017 at 2:46 PM, Klaus Wenninger <kwenning at redhat.com <mailto:kwenning at redhat.com>
>     >> <mailto:kwenning at redhat.com <mailto:kwenning at redhat.com>>> wrote:
>     >>
>     >>     On 05/15/2017 09:36 AM, Anu Pillai wrote:
>     >>     > Hi,
>     >>     >
>     >>     > We are running pacemaker cluster for managing our resources. We
>     >>     have 6
>     >>     > system running 5 resources and one is acting as standby. We have a
>     >>     > restriction that, only one resource can run in one node. But our
>     >>     > observation is whenever we add or delete a resource from cluster all
>     >>     > the remaining resources in the cluster are stopped and started back.
>     >>     >
>     >>     > Can you please guide us whether this normal behavior or we are
>     >>     missing
>     >>     > any configuration that is leading to this issue.
>     >>
>     >>     It should definitely be possible to prevent this behavior.
>     >>     If you share your config with us we might be able to
>     >>     track that down.
>     >>
>     >>     Regards,
>     >>     Klaus
>     >>
>     >>     >
>     >>     > Regards
>     >>     > Aswathi