[ClusterLabs] Help required for N+1 redundancy setup

Wed Mar 16 10:57:53 EDT 2016

On 03/16/2016 05:22 AM, Nikhil Utane wrote:
> I see following info gets updated in CIB. Can I use this or there is better
> way?
> 
> <node_state id="*node1*" uname="node1" in_ccm="false" crmd="offline"
> crm-debug-origin="peer_update_callback" join="*down*" expected="member">

in_ccm/crmd/join reflect the current state of the node (as known by the
partition that you're looking at the CIB on), so if the node went down
and came back up, it won't tell you anything about being down.

- in_ccm indicates that the node is part of the underlying cluster layer
(heartbeat/cman/corosync)

- crmd indicates that the node is communicating at the pacemaker layer

- join indicates what phase of the join process the node is at

There's not a direct way to see what node went down after the fact.
There are ways however:

- if the node was running resources, those will be failed, and those
failures (including node) will be shown in the cluster status

- the logs show all node membership events; you can search for logs such
as "state is now lost" and "left us"

- "stonith -H $NODE_NAME" will show the fence history for a given node,
so if the node went down due to fencing, it will show up there

- you can configure an ocf:pacemaker:ClusterMon resource to run crm_mon
periodically and run a script for node events, and you can write the
script to do whatever you want (email you, etc.) (in the upcoming 1.1.15
release, built-in notifications will make this more reliable and easier,
but any script you use with ClusterMon will still be usable with the new
method)

> On Wed, Mar 16, 2016 at 12:40 PM, Nikhil Utane <nikhil.subscribed at gmail.com>
> wrote:
> 
>> Hi Ken,
>>
>> Sorry about the long delay. This activity was de-focussed but now it's
>> back on track.
>>
>> One part of question that is still not answered is on the newly active
>> node, how to find out which was the node that went down?
>> Anything that gets updated in the status section that can be read and
>> figured out?
>>
>> Thanks.
>> Nikhil
>>
>> On Sat, Jan 9, 2016 at 3:31 AM, Ken Gaillot <kgaillot at redhat.com> wrote:
>>
>>> On 01/08/2016 11:13 AM, Nikhil Utane wrote:
>>>>> I think stickiness will do what you want here. Set a stickiness higher
>>>>> than the original node's preference, and the resource will want to stay
>>>>> where it is.
>>>>
>>>> Not sure I understand this. Stickiness will ensure that resources don't
>>>> move back when original node comes back up, isn't it?
>>>> But in my case, I want the newly standby node to become the backup node
>>> for
>>>> all other nodes. i.e. it should now be able to run all my resource
>>> groups
>>>> albeit with a lower score. How do I achieve that?
>>>
>>> Oh right. I forgot to ask whether you had an opt-out
>>> (symmetric-cluster=true, the default) or opt-in
>>> (symmetric-cluster=false) cluster. If you're opt-out, every node can run
>>> every resource unless you give it a negative preference.
>>>
>>> Partly it depends on whether there is a good reason to give each
>>> instance a "home" node. Often, there's not. If you just want to balance
>>> resources across nodes, the cluster will do that by default.
>>>
>>> If you prefer to put certain resources on certain nodes because the
>>> resources require more physical resources (RAM/CPU/whatever), you can
>>> set node attributes for that and use rules to set node preferences.
>>>
>>> Either way, you can decide whether you want stickiness with it.
>>>
>>>> Also can you answer, how to get the values of node that goes active and
>>> the
>>>> node that goes down inside the OCF agent?  Do I need to use
>>> notification or
>>>> some simpler alternative is available?
>>>> Thanks.
>>>>
>>>>
>>>> On Fri, Jan 8, 2016 at 9:30 PM, Ken Gaillot <kgaillot at redhat.com>
>>> wrote:
>>>>
>>>>> On 01/08/2016 06:55 AM, Nikhil Utane wrote:
>>>>>> Would like to validate my final config.
>>>>>>
>>>>>> As I mentioned earlier, I will be having (upto) 5 active servers and 1
>>>>>> standby server.
>>>>>> The standby server should take up the role of active that went down.
>>> Each
>>>>>> active has some unique configuration that needs to be preserved.
>>>>>>
>>>>>> 1) So I will create total 5 groups. Each group has a
>>> "heartbeat::IPaddr2
>>>>>> resource (for virtual IP) and my custom resource.
>>>>>> 2) The virtual IP needs to be read inside my custom OCF agent, so I
>>> will
>>>>>> make use of attribute reference and point to the value of IPaddr2
>>> inside
>>>>> my
>>>>>> custom resource to avoid duplication.
>>>>>> 3) I will then configure location constraint to run the group resource
>>>>> on 5
>>>>>> active nodes with higher score and lesser score on standby.
>>>>>> For e.g.
>>>>>> Group              Node            Score
>>>>>> ---------------------------------------------
>>>>>> MyGroup1        node1           500
>>>>>> MyGroup1        node6           0
>>>>>>
>>>>>> MyGroup2        node2           500
>>>>>> MyGroup2        node6           0
>>>>>> ..
>>>>>> MyGroup5        node5           500
>>>>>> MyGroup5        node6           0
>>>>>>
>>>>>> 4) Now if say node1 were to go down, then stop action on node1 will
>>> first
>>>>>> get called. Haven't decided if I need to do anything specific here.
>>>>>
>>>>> To clarify, if node1 goes down intentionally (e.g. standby or stop),
>>>>> then all resources on it will be stopped first. But if node1 becomes
>>>>> unavailable (e.g. crash or communication outage), it will get fenced.
>>>>>
>>>>>> 5) But when the start action of node 6 gets called, then using crm
>>>>> command
>>>>>> line interface, I will modify the above config to swap node 1 and
>>> node 6.
>>>>>> i.e.
>>>>>> MyGroup1        node6           500
>>>>>> MyGroup1        node1           0
>>>>>>
>>>>>> MyGroup2        node2           500
>>>>>> MyGroup2        node1           0
>>>>>>
>>>>>> 6) To do the above, I need the newly active and newly standby node
>>> names
>>>>> to
>>>>>> be passed to my start action. What's the best way to get this
>>> information
>>>>>> inside my OCF agent?
>>>>>
>>>>> Modifying the configuration from within an agent is dangerous -- too
>>>>> much potential for feedback loops between pacemaker and the agent.
>>>>>
>>>>> I think stickiness will do what you want here. Set a stickiness higher
>>>>> than the original node's preference, and the resource will want to stay
>>>>> where it is.
>>>>>
>>>>>> 7) Apart from node name, there will be other information which I plan
>>> to
>>>>>> pass by making use of node attributes. What's the best way to get this
>>>>>> information inside my OCF agent? Use crm command to query?
>>>>>
>>>>> Any of the command-line interfaces for doing so should be fine, but I'd
>>>>> recommend using one of the lower-level tools (crm_attribute or
>>>>> attrd_updater) so you don't have a dependency on a higher-level tool
>>>>> that may not always be installed.
>>>>>
>>>>>> Thank You.
>>>>>>
>>>>>> On Tue, Dec 22, 2015 at 9:44 PM, Nikhil Utane <
>>>>> nikhil.subscribed at gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Thanks to you Ken for giving all the pointers.
>>>>>>> Yes, I can use service start/stop which should be a lot simpler.
>>> Thanks
>>>>>>> again. :)
>>>>>>>
>>>>>>> On Tue, Dec 22, 2015 at 9:29 PM, Ken Gaillot <kgaillot at redhat.com>
>>>>> wrote:
>>>>>>>
>>>>>>>> On 12/22/2015 12:17 AM, Nikhil Utane wrote:
>>>>>>>>> I have prepared a write-up explaining my requirements and current
>>>>>>>> solution
>>>>>>>>> that I am proposing based on my understanding so far.
>>>>>>>>> Kindly let me know if what I am proposing is good or there is a
>>> better
>>>>>>>> way
>>>>>>>>> to achieve the same.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>
>>> https://drive.google.com/file/d/0B0zPvL-Tp-JSTEJpcUFTanhsNzQ/view?usp=sharing
>>>>>>>>>
>>>>>>>>> Let me know if you face any issue in accessing the above link.
>>> Thanks.
>>>>>>>>
>>>>>>>> This looks great. Very well thought-out.
>>>>>>>>
>>>>>>>> One comment:
>>>>>>>>
>>>>>>>> "8. In the event of any failover, the standby node will get notified
>>>>>>>> through an event and it will execute a script that will read the
>>>>>>>> configuration specific to the node that went down (again using
>>>>>>>> crm_attribute) and become active."
>>>>>>>>
>>>>>>>> It may not be necessary to use the notifications for this. Pacemaker
>>>>>>>> will call your resource agent with the "start" action on the standby
>>>>>>>> node, after ensuring it is stopped on the previous node. Hopefully
>>> the
>>>>>>>> resource agent's start action has (or can have, with configuration
>>>>>>>> options) all the information you need.
>>>>>>>>
>>>>>>>> If you do end up needing notifications, be aware that the feature
>>> will
>>>>>>>> be disabled by default in the 1.1.14 release, because changes in
>>> syntax
>>>>>>>> are expected in further development. You can define a compile-time
>>>>>>>> constant to enable them.