[ClusterLabs] In N+1 cluster, add/delete of one resource result in other node resources to restart

Wed May 24 05:58:26 UTC 2017

blank response for thread to appear in mailbox..pls ignore

On Tue, May 23, 2017 at 4:21 AM, Ken Gaillot <kgaillot at redhat.com> wrote:

> On 05/16/2017 04:34 AM, Anu Pillai wrote:
> > Hi,
> >
> > Please find attached debug logs for the stated problem as well as
> > crm_mon command outputs.
> > In this case we are trying to remove/delete res3 and system/node
> > (0005B94238BC) from the cluster.
> >
> > *_Test reproduction steps_*
> >
> > Current Configuration of the cluster:
> >  0005B9423910  - res2
> >  0005B9427C5A - res1
> >  0005B94238BC - res3
> >
> > *crm_mon output:*
> >
> > Defaulting to one-shot mode
> > You need to have curses available at compile time to enable console mode
> > Stack: corosync
> > Current DC: 0005B9423910 (version 1.1.14-5a6cdd1) - partition with quorum
> > Last updated: Tue May 16 12:21:23 2017          Last change: Tue May 16
> > 12:13:40 2017 by root via crm_attribute on 0005B9423910
> >
> > 3 nodes and 3 resources configured
> >
> > Online: [ 0005B94238BC 0005B9423910 0005B9427C5A ]
> >
> >  res2   (ocf::redundancy:RedundancyRA): Started 0005B9423910
> >  res1   (ocf::redundancy:RedundancyRA): Started 0005B9427C5A
> >  res3   (ocf::redundancy:RedundancyRA): Started 0005B94238BC
> >
> >
> > Trigger the delete operation for res3 and node 0005B94238BC.
> >
> > Following commands applied from node 0005B94238BC
> > $ pcs resource delete res3 --force
> > $ crm_resource -C res3
> > $ pcs cluster stop --force
>
> I don't think "pcs resource delete" or "pcs cluster stop" does anything
> with the --force option. In any case, --force shouldn't be needed here.
>
> The crm_mon output you see is actually not what it appears. It starts with:
>
> May 16 12:21:27 [4661] 0005B9423910       crmd:   notice: do_lrm_invoke:
>        Forcing the status of all resources to be redetected
>
> This is usually the result of a "cleanup all" command. It works by
> erasing the resource history, causing pacemaker to re-probe all nodes to
> get the current state. The history erasure makes it appear to crm_mon
> that the resources are stopped, but they actually are not.
>
> In this case, I'm not sure why it's doing a "cleanup all", since you
> only asked it to cleanup res3. Maybe in this particular instance, you
> actually did "crm_resource -C"?
>
> > Following command applied from DC(0005B9423910)
> > $ crm_node -R 0005B94238BC --force
>
> This can cause problems. This command shouldn't be run unless the node
> is removed from both pacemaker's and corosync's configuration. If you
> actually are trying to remove the node completely, a better alternative
> would be "pcs cluster node remove 0005B94238BC", which will handle all
> of that for you. If you're not trying to remove the node completely,
> then you shouldn't need this command at all.
>
> >
> >
> > *crm_mon output:*
> > *
> > *
> > Defaulting to one-shot mode
> > You need to have curses available at compile time to enable console mode
> > Stack: corosync
> > Current DC: 0005B9423910 (version 1.1.14-5a6cdd1) - partition with quorum
> > Last updated: Tue May 16 12:21:27 2017          Last change: Tue May 16
> > 12:21:26 2017 by root via cibadmin on 0005B94238BC
> >
> > 3 nodes and 2 resources configured
> >
> > Online: [ 0005B94238BC 0005B9423910 0005B9427C5A ]
> >
> >
> > Observation is remaining two resources res2 and res1 were stopped and
> > started.
> >
> >
> > Regards,
> > Aswathi
> >
> > On Mon, May 15, 2017 at 8:11 PM, Ken Gaillot <kgaillot at redhat.com
> > <mailto:kgaillot at redhat.com>> wrote:
> >
> >     On 05/15/2017 06:59 AM, Klaus Wenninger wrote:
> >     > On 05/15/2017 12:25 PM, Anu Pillai wrote:
> >     >> Hi Klaus,
> >     >>
> >     >> Please find attached cib.xml as well as corosync.conf.
> >
> >     Maybe you're only setting this while testing, but having
> >     stonith-enabled=false and no-quorum-policy=ignore is highly
> dangerous in
> >     any kind of network split.
> >
> >     FYI, default-action-timeout is deprecated in favor of setting a
> timeout
> >     in op_defaults, but it doesn't hurt anything.
> >
> >     > Why wouldn't you keep placement-strategy with default
> >     > to keep things simple. You aren't using any load-balancing
> >     > anyway as far as I understood it.
> >
> >     It looks like the intent is to use placement-strategy to limit each
> node
> >     to 1 resource. The configuration looks good for that.
> >
> >     > Haven't used resource-stickiness=INF. No idea which strange
> >     > behavior that triggers. Try to have it just higher than what
> >     > the other scores might some up to.
> >
> >     Either way would be fine. Using INFINITY ensures that no other
> >     combination of scores will override it.
> >
> >     > I might have overseen something in your scores but otherwise
> >     > there is nothing obvious to me.
> >     >
> >     > Regards,
> >     > Klaus
> >
> >     I don't see anything obvious either. If you have logs around the
> time of
> >     the incident, that might help.
> >
> >     >> Regards,
> >     >> Aswathi
> >     >>
> >     >> On Mon, May 15, 2017 at 2:46 PM, Klaus Wenninger <
> kwenning at redhat.com <mailto:kwenning at redhat.com>
> >     >> <mailto:kwenning at redhat.com <mailto:kwenning at redhat.com>>> wrote:
> >     >>
> >     >>     On 05/15/2017 09:36 AM, Anu Pillai wrote:
> >     >>     > Hi,
> >     >>     >
> >     >>     > We are running pacemaker cluster for managing our
> resources. We
> >     >>     have 6
> >     >>     > system running 5 resources and one is acting as standby. We
> have a
> >     >>     > restriction that, only one resource can run in one node.
> But our
> >     >>     > observation is whenever we add or delete a resource from
> cluster all
> >     >>     > the remaining resources in the cluster are stopped and
> started back.
> >     >>     >
> >     >>     > Can you please guide us whether this normal behavior or we
> are
> >     >>     missing
> >     >>     > any configuration that is leading to this issue.
> >     >>
> >     >>     It should definitely be possible to prevent this behavior.
> >     >>     If you share your config with us we might be able to
> >     >>     track that down.
> >     >>
> >     >>     Regards,
> >     >>     Klaus
> >     >>
> >     >>     >
> >     >>     > Regards
> >     >>     > Aswathi
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20170524/e5c94817/attachment-0002.html>