[ClusterLabs] Help required for N+1 redundancy setup

Thu Mar 17 00:00:10 EDT 2016

Thanks Ken for the detailed response.
I suppose I could even use some of the pcs/crm CLI commands then.
Cheers.

On Wed, Mar 16, 2016 at 8:27 PM, Ken Gaillot <kgaillot at redhat.com> wrote:

> On 03/16/2016 05:22 AM, Nikhil Utane wrote:
> > I see following info gets updated in CIB. Can I use this or there is
> better
> > way?
> >
> > <node_state id="*node1*" uname="node1" in_ccm="false" crmd="offline"
> > crm-debug-origin="peer_update_callback" join="*down*" expected="member">
>
> in_ccm/crmd/join reflect the current state of the node (as known by the
> partition that you're looking at the CIB on), so if the node went down
> and came back up, it won't tell you anything about being down.
>
> - in_ccm indicates that the node is part of the underlying cluster layer
> (heartbeat/cman/corosync)
>
> - crmd indicates that the node is communicating at the pacemaker layer
>
> - join indicates what phase of the join process the node is at
>
> There's not a direct way to see what node went down after the fact.
> There are ways however:
>
> - if the node was running resources, those will be failed, and those
> failures (including node) will be shown in the cluster status
>
> - the logs show all node membership events; you can search for logs such
> as "state is now lost" and "left us"
>
> - "stonith -H $NODE_NAME" will show the fence history for a given node,
> so if the node went down due to fencing, it will show up there
>
> - you can configure an ocf:pacemaker:ClusterMon resource to run crm_mon
> periodically and run a script for node events, and you can write the
> script to do whatever you want (email you, etc.) (in the upcoming 1.1.15
> release, built-in notifications will make this more reliable and easier,
> but any script you use with ClusterMon will still be usable with the new
> method)
>
> > On Wed, Mar 16, 2016 at 12:40 PM, Nikhil Utane <
> nikhil.subscribed at gmail.com>
> > wrote:
> >
> >> Hi Ken,
> >>
> >> Sorry about the long delay. This activity was de-focussed but now it's
> >> back on track.
> >>
> >> One part of question that is still not answered is on the newly active
> >> node, how to find out which was the node that went down?
> >> Anything that gets updated in the status section that can be read and
> >> figured out?
> >>
> >> Thanks.
> >> Nikhil
> >>
> >> On Sat, Jan 9, 2016 at 3:31 AM, Ken Gaillot <kgaillot at redhat.com>
> wrote:
> >>
> >>> On 01/08/2016 11:13 AM, Nikhil Utane wrote:
> >>>>> I think stickiness will do what you want here. Set a stickiness
> higher
> >>>>> than the original node's preference, and the resource will want to
> stay
> >>>>> where it is.
> >>>>
> >>>> Not sure I understand this. Stickiness will ensure that resources
> don't
> >>>> move back when original node comes back up, isn't it?
> >>>> But in my case, I want the newly standby node to become the backup
> node
> >>> for
> >>>> all other nodes. i.e. it should now be able to run all my resource
> >>> groups
> >>>> albeit with a lower score. How do I achieve that?
> >>>
> >>> Oh right. I forgot to ask whether you had an opt-out
> >>> (symmetric-cluster=true, the default) or opt-in
> >>> (symmetric-cluster=false) cluster. If you're opt-out, every node can
> run
> >>> every resource unless you give it a negative preference.
> >>>
> >>> Partly it depends on whether there is a good reason to give each
> >>> instance a "home" node. Often, there's not. If you just want to balance
> >>> resources across nodes, the cluster will do that by default.
> >>>
> >>> If you prefer to put certain resources on certain nodes because the
> >>> resources require more physical resources (RAM/CPU/whatever), you can
> >>> set node attributes for that and use rules to set node preferences.
> >>>
> >>> Either way, you can decide whether you want stickiness with it.
> >>>
> >>>> Also can you answer, how to get the values of node that goes active
> and
> >>> the
> >>>> node that goes down inside the OCF agent?  Do I need to use
> >>> notification or
> >>>> some simpler alternative is available?
> >>>> Thanks.
> >>>>
> >>>>
> >>>> On Fri, Jan 8, 2016 at 9:30 PM, Ken Gaillot <kgaillot at redhat.com>
> >>> wrote:
> >>>>
> >>>>> On 01/08/2016 06:55 AM, Nikhil Utane wrote:
> >>>>>> Would like to validate my final config.
> >>>>>>
> >>>>>> As I mentioned earlier, I will be having (upto) 5 active servers
> and 1
> >>>>>> standby server.
> >>>>>> The standby server should take up the role of active that went down.
> >>> Each
> >>>>>> active has some unique configuration that needs to be preserved.
> >>>>>>
> >>>>>> 1) So I will create total 5 groups. Each group has a
> >>> "heartbeat::IPaddr2
> >>>>>> resource (for virtual IP) and my custom resource.
> >>>>>> 2) The virtual IP needs to be read inside my custom OCF agent, so I
> >>> will
> >>>>>> make use of attribute reference and point to the value of IPaddr2
> >>> inside
> >>>>> my
> >>>>>> custom resource to avoid duplication.
> >>>>>> 3) I will then configure location constraint to run the group
> resource
> >>>>> on 5
> >>>>>> active nodes with higher score and lesser score on standby.
> >>>>>> For e.g.
> >>>>>> Group              Node            Score
> >>>>>> ---------------------------------------------
> >>>>>> MyGroup1        node1           500
> >>>>>> MyGroup1        node6           0
> >>>>>>
> >>>>>> MyGroup2        node2           500
> >>>>>> MyGroup2        node6           0
> >>>>>> ..
> >>>>>> MyGroup5        node5           500
> >>>>>> MyGroup5        node6           0
> >>>>>>
> >>>>>> 4) Now if say node1 were to go down, then stop action on node1 will
> >>> first
> >>>>>> get called. Haven't decided if I need to do anything specific here.
> >>>>>
> >>>>> To clarify, if node1 goes down intentionally (e.g. standby or stop),
> >>>>> then all resources on it will be stopped first. But if node1 becomes
> >>>>> unavailable (e.g. crash or communication outage), it will get fenced.
> >>>>>
> >>>>>> 5) But when the start action of node 6 gets called, then using crm
> >>>>> command
> >>>>>> line interface, I will modify the above config to swap node 1 and
> >>> node 6.
> >>>>>> i.e.
> >>>>>> MyGroup1        node6           500
> >>>>>> MyGroup1        node1           0
> >>>>>>
> >>>>>> MyGroup2        node2           500
> >>>>>> MyGroup2        node1           0
> >>>>>>
> >>>>>> 6) To do the above, I need the newly active and newly standby node
> >>> names
> >>>>> to
> >>>>>> be passed to my start action. What's the best way to get this
> >>> information
> >>>>>> inside my OCF agent?
> >>>>>
> >>>>> Modifying the configuration from within an agent is dangerous -- too
> >>>>> much potential for feedback loops between pacemaker and the agent.
> >>>>>
> >>>>> I think stickiness will do what you want here. Set a stickiness
> higher
> >>>>> than the original node's preference, and the resource will want to
> stay
> >>>>> where it is.
> >>>>>
> >>>>>> 7) Apart from node name, there will be other information which I
> plan
> >>> to
> >>>>>> pass by making use of node attributes. What's the best way to get
> this
> >>>>>> information inside my OCF agent? Use crm command to query?
> >>>>>
> >>>>> Any of the command-line interfaces for doing so should be fine, but
> I'd
> >>>>> recommend using one of the lower-level tools (crm_attribute or
> >>>>> attrd_updater) so you don't have a dependency on a higher-level tool
> >>>>> that may not always be installed.
> >>>>>
> >>>>>> Thank You.
> >>>>>>
> >>>>>> On Tue, Dec 22, 2015 at 9:44 PM, Nikhil Utane <
> >>>>> nikhil.subscribed at gmail.com>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Thanks to you Ken for giving all the pointers.
> >>>>>>> Yes, I can use service start/stop which should be a lot simpler.
> >>> Thanks
> >>>>>>> again. :)
> >>>>>>>
> >>>>>>> On Tue, Dec 22, 2015 at 9:29 PM, Ken Gaillot <kgaillot at redhat.com>
> >>>>> wrote:
> >>>>>>>
> >>>>>>>> On 12/22/2015 12:17 AM, Nikhil Utane wrote:
> >>>>>>>>> I have prepared a write-up explaining my requirements and current
> >>>>>>>> solution
> >>>>>>>>> that I am proposing based on my understanding so far.
> >>>>>>>>> Kindly let me know if what I am proposing is good or there is a
> >>> better
> >>>>>>>> way
> >>>>>>>>> to achieve the same.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>
> >>>
> https://drive.google.com/file/d/0B0zPvL-Tp-JSTEJpcUFTanhsNzQ/view?usp=sharing
> >>>>>>>>>
> >>>>>>>>> Let me know if you face any issue in accessing the above link.
> >>> Thanks.
> >>>>>>>>
> >>>>>>>> This looks great. Very well thought-out.
> >>>>>>>>
> >>>>>>>> One comment:
> >>>>>>>>
> >>>>>>>> "8. In the event of any failover, the standby node will get
> notified
> >>>>>>>> through an event and it will execute a script that will read the
> >>>>>>>> configuration specific to the node that went down (again using
> >>>>>>>> crm_attribute) and become active."
> >>>>>>>>
> >>>>>>>> It may not be necessary to use the notifications for this.
> Pacemaker
> >>>>>>>> will call your resource agent with the "start" action on the
> standby
> >>>>>>>> node, after ensuring it is stopped on the previous node. Hopefully
> >>> the
> >>>>>>>> resource agent's start action has (or can have, with configuration
> >>>>>>>> options) all the information you need.
> >>>>>>>>
> >>>>>>>> If you do end up needing notifications, be aware that the feature
> >>> will
> >>>>>>>> be disabled by default in the 1.1.14 release, because changes in
> >>> syntax
> >>>>>>>> are expected in further development. You can define a compile-time
> >>>>>>>> constant to enable them.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20160317/f5b38f70/attachment-0003.html>