[ClusterLabs] In N+1 cluster, add/delete of one resource result in other node resources to restart

Fri May 19 17:53:52 CEST 2017

On 05/19/2017 04:14 AM, Anu Pillai wrote:
> Hi Ken,
> 
> Did you get any chance to go through the logs? 

sorry, not yet

> Do you need any more details ?
> 
> Regards,
> Aswathi
> 
> On Tue, May 16, 2017 at 3:04 PM, Anu Pillai
> <anu.pillai.subscrib at gmail.com <mailto:anu.pillai.subscrib at gmail.com>>
> wrote:
> 
>     Hi,
> 
>     Please find attached debug logs for the stated problem as well as
>     crm_mon command outputs. 
>     In this case we are trying to remove/delete res3 and system/node
>     (0005B94238BC) from the cluster.
> 
>     *_Test reproduction steps_*
> 
>     Current Configuration of the cluster:
>      0005B9423910  - res2 
>      0005B9427C5A - res1
>      0005B94238BC - res3
> 
>     *crm_mon output:*
> 
>     Defaulting to one-shot mode
>     You need to have curses available at compile time to enable console mode
>     Stack: corosync
>     Current DC: 0005B9423910 (version 1.1.14-5a6cdd1) - partition with
>     quorum
>     Last updated: Tue May 16 12:21:23 2017          Last change: Tue May
>     16 12:13:40 2017 by root via crm_attribute on 0005B9423910
> 
>     3 nodes and 3 resources configured
> 
>     Online: [ 0005B94238BC 0005B9423910 0005B9427C5A ]
> 
>      res2   (ocf::redundancy:RedundancyRA): Started 0005B9423910
>      res1   (ocf::redundancy:RedundancyRA): Started 0005B9427C5A
>      res3   (ocf::redundancy:RedundancyRA): Started 0005B94238BC
> 
> 
>     Trigger the delete operation for res3 and node 0005B94238BC.
> 
>     Following commands applied from node 0005B94238BC
>     $ pcs resource delete res3 --force
>     $ crm_resource -C res3
>     $ pcs cluster stop --force 
> 
>     Following command applied from DC(0005B9423910)
>     $ crm_node -R 0005B94238BC --force
> 
> 
>     *crm_mon output:*
>     *
>     *
>     Defaulting to one-shot mode
>     You need to have curses available at compile time to enable console mode
>     Stack: corosync
>     Current DC: 0005B9423910 (version 1.1.14-5a6cdd1) - partition with
>     quorum
>     Last updated: Tue May 16 12:21:27 2017          Last change: Tue May
>     16 12:21:26 2017 by root via cibadmin on 0005B94238BC
> 
>     3 nodes and 2 resources configured
> 
>     Online: [ 0005B94238BC 0005B9423910 0005B9427C5A ]
> 
> 
>     Observation is remaining two resources res2 and res1 were stopped
>     and started.
> 
> 
>     Regards,
>     Aswathi
> 
>     On Mon, May 15, 2017 at 8:11 PM, Ken Gaillot <kgaillot at redhat.com
>     <mailto:kgaillot at redhat.com>> wrote:
> 
>         On 05/15/2017 06:59 AM, Klaus Wenninger wrote:
>         > On 05/15/2017 12:25 PM, Anu Pillai wrote:
>         >> Hi Klaus,
>         >>
>         >> Please find attached cib.xml as well as corosync.conf.
> 
>         Maybe you're only setting this while testing, but having
>         stonith-enabled=false and no-quorum-policy=ignore is highly
>         dangerous in
>         any kind of network split.
> 
>         FYI, default-action-timeout is deprecated in favor of setting a
>         timeout
>         in op_defaults, but it doesn't hurt anything.
> 
>         > Why wouldn't you keep placement-strategy with default
>         > to keep things simple. You aren't using any load-balancing
>         > anyway as far as I understood it.
> 
>         It looks like the intent is to use placement-strategy to limit
>         each node
>         to 1 resource. The configuration looks good for that.
> 
>         > Haven't used resource-stickiness=INF. No idea which strange
>         > behavior that triggers. Try to have it just higher than what
>         > the other scores might some up to.
> 
>         Either way would be fine. Using INFINITY ensures that no other
>         combination of scores will override it.
> 
>         > I might have overseen something in your scores but otherwise
>         > there is nothing obvious to me.
>         >
>         > Regards,
>         > Klaus
> 
>         I don't see anything obvious either. If you have logs around the
>         time of
>         the incident, that might help.
> 
>         >> Regards,
>         >> Aswathi
>         >>
>         >> On Mon, May 15, 2017 at 2:46 PM, Klaus Wenninger <kwenning at redhat.com <mailto:kwenning at redhat.com>
>         >> <mailto:kwenning at redhat.com <mailto:kwenning at redhat.com>>> wrote:
>         >>
>         >>     On 05/15/2017 09:36 AM, Anu Pillai wrote:
>         >>     > Hi,
>         >>     >
>         >>     > We are running pacemaker cluster for managing our resources. We
>         >>     have 6
>         >>     > system running 5 resources and one is acting as standby. We have a
>         >>     > restriction that, only one resource can run in one node. But our
>         >>     > observation is whenever we add or delete a resource from cluster all
>         >>     > the remaining resources in the cluster are stopped and started back.
>         >>     >
>         >>     > Can you please guide us whether this normal behavior or we are
>         >>     missing
>         >>     > any configuration that is leading to this issue.
>         >>
>         >>     It should definitely be possible to prevent this behavior.
>         >>     If you share your config with us we might be able to
>         >>     track that down.
>         >>
>         >>     Regards,
>         >>     Klaus
>         >>
>         >>     >
>         >>     > Regards
>         >>     > Aswathi
> 
>         _______________________________________________
>         Users mailing list: Users at clusterlabs.org
>         <mailto:Users at clusterlabs.org>
>         http://lists.clusterlabs.org/mailman/listinfo/users
>         <http://lists.clusterlabs.org/mailman/listinfo/users>
> 
>         Project Home: http://www.clusterlabs.org
>         Getting started:
>         http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>         <http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf>
>         Bugs: http://bugs.clusterlabs.org
> 
> 
>