[ClusterLabs] ClusterIP won't return to recovered node

Thu May 25 17:33:47 EDT 2017

On 05/24/2017 12:27 PM, Dan Ragle wrote:
> I suspect this has been asked before and apologize if so, a google
> search didn't seem to find anything that was helpful to me ...
> 
> I'm setting up an active/active two-node cluster and am having an issue
> where one of my two defined clusterIPs will not return to the other node
> after it (the other node) has been recovered.
> 
> I'm running on CentOS 7.3. My resource setups look like this:
> 
> # cibadmin -Q|grep dc-version
>         <nvpair id="cib-bootstrap-options-dc-version" name="dc-version"
> value="1.1.15-11.el7_3.4-e174ec8"/>
> 
> # pcs resource show PublicIP-clone
>  Clone: PublicIP-clone
>   Meta Attrs: clone-max=2 clone-node-max=2 globally-unique=true
> interleave=true
>   Resource: PublicIP (class=ocf provider=heartbeat type=IPaddr2)
>    Attributes: ip=75.144.71.38 cidr_netmask=24 nic=bond0
>    Meta Attrs: resource-stickiness=0
>    Operations: start interval=0s timeout=20s (PublicIP-start-interval-0s)
>                stop interval=0s timeout=20s (PublicIP-stop-interval-0s)
>                monitor interval=30s (PublicIP-monitor-interval-30s)
> 
> # pcs resource show PrivateIP-clone
>  Clone: PrivateIP-clone
>   Meta Attrs: clone-max=2 clone-node-max=2 globally-unique=true
> interleave=true
>   Resource: PrivateIP (class=ocf provider=heartbeat type=IPaddr2)
>    Attributes: ip=192.168.1.3 nic=bond1 cidr_netmask=24
>    Meta Attrs: resource-stickiness=0
>    Operations: start interval=0s timeout=20s (PrivateIP-start-interval-0s)
>                stop interval=0s timeout=20s (PrivateIP-stop-interval-0s)
>                monitor interval=10s timeout=20s
> (PrivateIP-monitor-interval-10s)
> 
> # pcs constraint --full | grep -i publicip
>   start WEB-clone then start PublicIP-clone (kind:Mandatory)
> (id:order-WEB-clone-PublicIP-clone-mandatory)
> # pcs constraint --full | grep -i privateip
>   start WEB-clone then start PrivateIP-clone (kind:Mandatory)
> (id:order-WEB-clone-PrivateIP-clone-mandatory)

FYI These constraints cover ordering only. If you also want to be sure
that the IPs only start on a node where the web service is functional,
then you also need colocation constraints.

> 
> When I first create the resources, they split across the two nodes as
> expected/desired:
> 
>  Clone Set: PublicIP-clone [PublicIP] (unique)
>      PublicIP:0        (ocf::heartbeat:IPaddr2):       Started node1-pcs
>      PublicIP:1        (ocf::heartbeat:IPaddr2):       Started node2-pcs
>  Clone Set: PrivateIP-clone [PrivateIP] (unique)
>      PrivateIP:0        (ocf::heartbeat:IPaddr2):       Started node1-pcs
>      PrivateIP:1        (ocf::heartbeat:IPaddr2):       Started node2-pcs
>  Clone Set: WEB-clone [WEB]
>      Started: [ node1-pcs node2-pcs ]
> 
> I then put the second node in standby:
> 
> # pcs node standby node2-pcs
> 
> And the IPs both jump to node1 as expected:
> 
>  Clone Set: PublicIP-clone [PublicIP] (unique)
>      PublicIP:0        (ocf::heartbeat:IPaddr2):       Started node1-pcs
>      PublicIP:1        (ocf::heartbeat:IPaddr2):       Started node1-pcs
>  Clone Set: WEB-clone [WEB]
>      Started: [ node1-pcs ]
>      Stopped: [ node2-pcs ]
>  Clone Set: PrivateIP-clone [PrivateIP] (unique)
>      PrivateIP:0        (ocf::heartbeat:IPaddr2):       Started node1-pcs
>      PrivateIP:1        (ocf::heartbeat:IPaddr2):       Started node1-pcs
> 
> Then unstandby the second node:
> 
> # pcs node unstandby node2-pcs
> 
> The publicIP goes back, but the private does not:
> 
>  Clone Set: PublicIP-clone [PublicIP] (unique)
>      PublicIP:0        (ocf::heartbeat:IPaddr2):       Started node1-pcs
>      PublicIP:1        (ocf::heartbeat:IPaddr2):       Started node2-pcs
>  Clone Set: WEB-clone [WEB]
>      Started: [ node1-pcs node2-pcs ]
>  Clone Set: PrivateIP-clone [PrivateIP] (unique)
>      PrivateIP:0        (ocf::heartbeat:IPaddr2):       Started node1-pcs
>      PrivateIP:1        (ocf::heartbeat:IPaddr2):       Started node1-pcs
> 
> Anybody see what I'm doing wrong? I'm not seeing anything in the logs to
> indicate that it tries node2 and then fails; but I'm fairly new to the
> software so it's possible I'm not looking in the right place.

The pcs status would show any failed actions, and anything important in
the logs would start with "error:" or "warning:".

At any given time, one of the nodes is the DC, meaning it schedules
actions for the whole cluster. That node will have more "pengine:"
messages in its logs at the time. You can check those logs to see what
decisions were made, as well as a "saving inputs" message to get the
cluster state that was used to make those decisions. There is a
crm_simulate tool that you can run on that file to get more information.

By default, pacemaker will try to balance the number of resources
running on each node, so I'm not sure why in this case node1 has four
resources and node2 has two. crm_simulate might help explain it.

However, there's nothing here telling pacemaker that the instances of
PrivateIP should run on different nodes when possible. With your
existing constraints, pacemaker would be equally happy to run both
PublicIP instances on one node and both PrivateIP instances on the other
node.

I think you could probably get what you want by putting an optional
(<INFINITY) colocation preference between PrivateIP and PublicIP. The
only way pacemaker could satisfy that would be to run one of each on
each node.

> Also, I noticed when putting a node in standby the main NIC appears to
> be interrupted momentarily (long enough for my SSH session, which is
> connected via the permanent IP on the NIC and not the clusterIP, to be
> dropped). Is there any way to avoid this? I was thinking that the
> cluster operations would only affect the ClusteIP and not the other IPs
> being served on that NIC.

Nothing in the cluster should cause that behavior. Check all the system
logs around the time to see if anything unusual is reported.

> 
> Thanks!
> 
> Dan