[ClusterLabs] In N+1 cluster, add/delete of one resource result in other node resources to restart
Nikhil Utane
nikhil.subscribed at gmail.com
Wed May 24 08:50:01 CEST 2017
Thanks Aswathi.
(My account had stopped working due to mail bounces, never seen that occur
on gmail accounts)
Ken,
Answers to your questions are below:
*1. Using force option*
A) During our testing we had observed that in some instances the resource
deletion would fail and that's why we added the force option. With the
force option we never saw the problem again.
*2. "Maybe in this particular instance, you actually did "crm_resource
-C"?"*
A) This step is done through code so there is no human involvement. We are
printing the full command and we always see resource name is included. So
this cannot happen.
*3. $ crm_node -R 0005B94238BC --force*
A) Yes, we want to remove the node completely. We are not specifying the
node information in corosync.conf so there is nothing to be removed there.
I need to go back and check but I vaguely remember that because of some
issue we had switched from using "pcs cluster node remove" command to
crm_node -R command. Perhaps because it gave us the option to use force.
*4. "No STONITH and QUORUM"*
A) As I have mentioned earlier, split-brain doesn't pose a problem for us
since we have a second line of defense based on our architecture. Hence we
have made a conscious decision to disable it. The config IS for production.
BTW, we also issue a "pcs resource disable" command before doing a "pcs
resource delete". Not sure if that makes any difference.
We will play around with those 4-5 commands that we execute to see whether
the resource restart happens as a reaction to any of those command.
-Thanks & Regards
Nikhil
On Wed, May 24, 2017 at 11:28 AM, Anu Pillai <anu.pillai.subscrib at gmail.com>
wrote:
> blank response for thread to appear in mailbox..pls ignore
>
> On Tue, May 23, 2017 at 4:21 AM, Ken Gaillot <kgaillot at redhat.com> wrote:
>
>> On 05/16/2017 04:34 AM, Anu Pillai wrote:
>> > Hi,
>> >
>> > Please find attached debug logs for the stated problem as well as
>> > crm_mon command outputs.
>> > In this case we are trying to remove/delete res3 and system/node
>> > (0005B94238BC) from the cluster.
>> >
>> > *_Test reproduction steps_*
>> >
>> > Current Configuration of the cluster:
>> > 0005B9423910 - res2
>> > 0005B9427C5A - res1
>> > 0005B94238BC - res3
>> >
>> > *crm_mon output:*
>> >
>> > Defaulting to one-shot mode
>> > You need to have curses available at compile time to enable console mode
>> > Stack: corosync
>> > Current DC: 0005B9423910 (version 1.1.14-5a6cdd1) - partition with
>> quorum
>> > Last updated: Tue May 16 12:21:23 2017 Last change: Tue May 16
>> > 12:13:40 2017 by root via crm_attribute on 0005B9423910
>> >
>> > 3 nodes and 3 resources configured
>> >
>> > Online: [ 0005B94238BC 0005B9423910 0005B9427C5A ]
>> >
>> > res2 (ocf::redundancy:RedundancyRA): Started 0005B9423910
>> > res1 (ocf::redundancy:RedundancyRA): Started 0005B9427C5A
>> > res3 (ocf::redundancy:RedundancyRA): Started 0005B94238BC
>> >
>> >
>> > Trigger the delete operation for res3 and node 0005B94238BC.
>> >
>> > Following commands applied from node 0005B94238BC
>> > $ pcs resource delete res3 --force
>> > $ crm_resource -C res3
>> > $ pcs cluster stop --force
>>
>> I don't think "pcs resource delete" or "pcs cluster stop" does anything
>> with the --force option. In any case, --force shouldn't be needed here.
>>
>> The crm_mon output you see is actually not what it appears. It starts
>> with:
>>
>> May 16 12:21:27 [4661] 0005B9423910 crmd: notice: do_lrm_invoke:
>> Forcing the status of all resources to be redetected
>>
>> This is usually the result of a "cleanup all" command. It works by
>> erasing the resource history, causing pacemaker to re-probe all nodes to
>> get the current state. The history erasure makes it appear to crm_mon
>> that the resources are stopped, but they actually are not.
>>
>> In this case, I'm not sure why it's doing a "cleanup all", since you
>> only asked it to cleanup res3. Maybe in this particular instance, you
>> actually did "crm_resource -C"?
>>
>> > Following command applied from DC(0005B9423910)
>> > $ crm_node -R 0005B94238BC --force
>>
>> This can cause problems. This command shouldn't be run unless the node
>> is removed from both pacemaker's and corosync's configuration. If you
>> actually are trying to remove the node completely, a better alternative
>> would be "pcs cluster node remove 0005B94238BC", which will handle all
>> of that for you. If you're not trying to remove the node completely,
>> then you shouldn't need this command at all.
>>
>> >
>> >
>> > *crm_mon output:*
>> > *
>> > *
>> > Defaulting to one-shot mode
>> > You need to have curses available at compile time to enable console mode
>> > Stack: corosync
>> > Current DC: 0005B9423910 (version 1.1.14-5a6cdd1) - partition with
>> quorum
>> > Last updated: Tue May 16 12:21:27 2017 Last change: Tue May 16
>> > 12:21:26 2017 by root via cibadmin on 0005B94238BC
>> >
>> > 3 nodes and 2 resources configured
>> >
>> > Online: [ 0005B94238BC 0005B9423910 0005B9427C5A ]
>> >
>> >
>> > Observation is remaining two resources res2 and res1 were stopped and
>> > started.
>> >
>> >
>> > Regards,
>> > Aswathi
>> >
>> > On Mon, May 15, 2017 at 8:11 PM, Ken Gaillot <kgaillot at redhat.com
>> > <mailto:kgaillot at redhat.com>> wrote:
>> >
>> > On 05/15/2017 06:59 AM, Klaus Wenninger wrote:
>> > > On 05/15/2017 12:25 PM, Anu Pillai wrote:
>> > >> Hi Klaus,
>> > >>
>> > >> Please find attached cib.xml as well as corosync.conf.
>> >
>> > Maybe you're only setting this while testing, but having
>> > stonith-enabled=false and no-quorum-policy=ignore is highly
>> dangerous in
>> > any kind of network split.
>> >
>> > FYI, default-action-timeout is deprecated in favor of setting a
>> timeout
>> > in op_defaults, but it doesn't hurt anything.
>> >
>> > > Why wouldn't you keep placement-strategy with default
>> > > to keep things simple. You aren't using any load-balancing
>> > > anyway as far as I understood it.
>> >
>> > It looks like the intent is to use placement-strategy to limit each
>> node
>> > to 1 resource. The configuration looks good for that.
>> >
>> > > Haven't used resource-stickiness=INF. No idea which strange
>> > > behavior that triggers. Try to have it just higher than what
>> > > the other scores might some up to.
>> >
>> > Either way would be fine. Using INFINITY ensures that no other
>> > combination of scores will override it.
>> >
>> > > I might have overseen something in your scores but otherwise
>> > > there is nothing obvious to me.
>> > >
>> > > Regards,
>> > > Klaus
>> >
>> > I don't see anything obvious either. If you have logs around the
>> time of
>> > the incident, that might help.
>> >
>> > >> Regards,
>> > >> Aswathi
>> > >>
>> > >> On Mon, May 15, 2017 at 2:46 PM, Klaus Wenninger <
>> kwenning at redhat.com <mailto:kwenning at redhat.com>
>> > >> <mailto:kwenning at redhat.com <mailto:kwenning at redhat.com>>>
>> wrote:
>> > >>
>> > >> On 05/15/2017 09:36 AM, Anu Pillai wrote:
>> > >> > Hi,
>> > >> >
>> > >> > We are running pacemaker cluster for managing our
>> resources. We
>> > >> have 6
>> > >> > system running 5 resources and one is acting as standby.
>> We have a
>> > >> > restriction that, only one resource can run in one node.
>> But our
>> > >> > observation is whenever we add or delete a resource from
>> cluster all
>> > >> > the remaining resources in the cluster are stopped and
>> started back.
>> > >> >
>> > >> > Can you please guide us whether this normal behavior or we
>> are
>> > >> missing
>> > >> > any configuration that is leading to this issue.
>> > >>
>> > >> It should definitely be possible to prevent this behavior.
>> > >> If you share your config with us we might be able to
>> > >> track that down.
>> > >>
>> > >> Regards,
>> > >> Klaus
>> > >>
>> > >> >
>> > >> > Regards
>> > >> > Aswathi
>>
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20170524/0a1d3046/attachment-0001.html>
More information about the Users
mailing list