[Pacemaker] Recommendations in reducing failover response time

Mon Nov 5 08:17:50 EST 2012

Le 05/11/2012 13:47, Arturo Borrero Gonzalez a écrit :
> Hi there!
>
> I'm working with a pacemaker cluster that acts as the gateway of
> several vlans (subnets) of my network.
> So, all resources i'm managing are: virtual IPs (several), firewall,
> xorp, radvd, dhcp, etc.. (all network infraestructure-related
> services). There are 26 resources, 21 being virtual IPs.

I do have the exact same configuration, two node cluster of firewalls 
with a group of VIPs collocated with everything else. So far, so good, 
and the complete fail-over is as fast as my slowest resources, which is, 
a couple second long to start but the VIP failover which occurs first is 
almost instant.

>
> We have been using our current configuration for 3 or 4 month, and
> after some test, we are not very happy with the implementation.
>
> The cluster is active-passive, with 2 nodes. Servers are sunfire amd64
> with 16 cores.
>
> I see mainly two issues, maybe both related:
>
> 1) IPadd2 and IPv6addr RAs are slow or our configuration make their
> response slow. So, move resources from one node to other takes so long
> (we don't have an exact time, but I think ~2-3 minutes). I suspect
> some kind of bad interaction between the firewall and those RAs, but
> we check pretty deep.

This is definitely too long.
Have you a valid DNS configuration on each node ?
What does the logs says for this migration ?

I remember a bug in the IPaddr RA that was depending on DNS for a 
pipeline such as "route | grep" ...

Or maybe it's long to remove an alias on a bond.

>
> 2) The constraints we add for colocation and grouping are not allowing
> the cluster to quickly reply to failover situations. We group all VIPs
> in one group and colocate the other resources where the VIP group is.
>
> Here is our current config in the production cluster:
>
> node node1
> node node2
> primitive p_dhcp lsb:/etc/init.d/isc-dhcp-server \
>          op monitor interval="5"
> primitive p_firewall lsb:/etc/init.d/firewall \
>          op monitor interval="20" timeout="5" onfail="restart" \
>          op start interval="0" timeout="50" \
>          op stop interval="0" timeout="20"
> primitive p_ipv ocf:heartbeat:IPaddr2 \
>          params ip="10.0.3.50" nic="bond0"
> primitive p_ipv_nat ocf:heartbeat:IPaddr2 \
>          params ip="10.0.3.57" nic="bond1.27"
> primitive p_ipv_openvpn ocf:heartbeat:IPaddr2 \
>          params ip="10.0.3.51" nic="bond0"
> primitive p_ipv_v6 ocf:heartbeat:IPv6addr \
>          params ipv6addr="fc00:18::9" nic="bond0"
> primitive p_ipv_v6_vlan27 ocf:heartbeat:IPv6addr \
>          params ipv6addr="fc00:27::1" cidr_netmask="64" nic="bond1.27"
> primitive p_ipv_v6_vlan31 ocf:heartbeat:IPv6addr \
>          params ipv6addr="fc00:31::1" cidr_netmask="64" nic="bond1.31"
> primitive p_ipv_v6_vlan34 ocf:heartbeat:IPv6addr \
>          params ipv6addr="fc00:34::1" cidr_netmask="64" nic="bond1.34"
> primitive p_ipv_v6_vlan51 ocf:heartbeat:IPv6addr \
>          params ipv6addr="fc00:51::1" cidr_netmask="64" nic="bond1.51"
> primitive p_ipv_v6_vlan54 ocf:heartbeat:IPv6addr \
>          params ipv6addr="fc00:54::1" cidr_netmask="64" nic="bond1.54"
> primitive p_ipv_v6_vlan6 ocf:heartbeat:IPv6addr \
>          params ipv6addr="fc00:6::1" cidr_netmask="64" nic="bond1.6"
> primitive p_ipv_v6_vlan7 ocf:heartbeat:IPv6addr \
>          params ipv6addr="fc00:7::1" cidr_netmask="64" nic="bond1.7"
> primitive p_ipv_vlan10 ocf:heartbeat:IPaddr2 \
>          params ip="10.0.3.65" nic="bond1.10"
> primitive p_ipv_vlan23 ocf:heartbeat:IPaddr2 \
>          params ip="10.0.2.193" nic="bond1.23"
> primitive p_ipv_vlan27 ocf:heartbeat:IPaddr2 \
>          params ip="10.0.4.129" nic="bond1.27"
> primitive p_ipv_vlan28 ocf:heartbeat:IPaddr2 \
>          params ip="10.0.3.9" nic="bond1.28"
> primitive p_ipv_vlan31 ocf:heartbeat:IPaddr2 \
>          params ip="10.0.8.129" nic="bond1.31"
> primitive p_ipv_vlan34 ocf:heartbeat:IPaddr2 \
>          params ip="10.0.5.1" nic="bond1.34"
> primitive p_ipv_vlan51 ocf:heartbeat:IPaddr2 \
>          params ip="10.0.4.1" nic="bond1.51"
> primitive p_ipv_vlan54 ocf:heartbeat:IPaddr2 \
>          params ip="10.0.3.193" nic="bond1.54"
> primitive p_ipv_vlan6 ocf:heartbeat:IPaddr2 \
>          params ip="10.0.5.1" nic="bond1.6"
> primitive p_ipv_vlan7 ocf:heartbeat:IPaddr2 \
>          params ip="10.0.2.1" nic="bond1.7"
> primitive p_openvpn lsb:/etc/init.d/openvpn \
>          op monitor interval="5"
> primitive p_radvd lsb:/etc/init.d/radvd \
>          op monitor interval="5"
> primitive p_xorp lsb:/etc/init.d/xorp \
>          op monitor interval="5"
> group g_ipv p_ipv_vlan27 p_ipv p_ipv_vlan7 p_ipv_vlan6 p_ipv_vlan54
> p_ipv_vlan51 p_ipv_vlan31 p_ipv_vlan34 p_ipv_nat p_ipv_vlan23
> p_ipv_vlan10 p_ipv_vlan28 p_ipv_openvpn p_ipv_v6 p_ipv_v6_vlan51
> p_ipv_v6_vlan31 p_ipv_v6_vlan6 p_ipv_v6_vlan7 p_ipv_v6_vlan27
> p_ipv_v6_vlan54 p_ipv_v6_vlan34
> colocation dhcp-ipv inf: p_dhcp g_ipv
> colocation firewall-ipv inf: p_firewall g_ipv
> colocation openvpn-ipv inf: p_openvpn p_ipv
> colocation radvd-ipv inf: p_radvd g_ipv
> colocation xorp-ipv inf: p_xorp g_ipv
> property $id="cib-bootstrap-options" \
>          dc-version="1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff" \
>          cluster-infrastructure="openais" \
>          expected-quorum-votes="2" \
>          stonith-enabled="false" \
>          no-quorum-policy="ignore" \
>          last-lrm-refresh="1352111523"
> rsc_defaults $id="rsc-options" \
>          resource-stickiness="1000000"
>
>

I think these colocation constraints are missing ordering constraint, 
for security reasons you should make sure that the VIPs are first 
created, then the firewall should be started, and finally anything else, 
otherwise openvpn or radvd could start /before/ the firewall.

Colocation doesn't imply ordering. (had a hard time understanding that 
one...)

In my configuration, I aggregated all that stuff into one colocation set 
+ one ordering set. Maybe you could see if it fits best and/or helps you 
with the duration of the failover.

Eg:

group IPHA eth0.10HA eth0.11HA eth0.12HA eth0.13HA eth0.14HA eth0.15HA 
eth0.16HA eth0.18HA eth0.19HA eth0.20HA eth0.21HA eth0.22HA eth0.23-1HA 
eth0.23-254HA eth0.24HA eth0.26HA eth0.2HA eth0.3-10HA eth0.3-15HA 
eth0.3-30HA eth0.4HA eth0.5-10HA eth0.5-215HA eth0.5-230HA eth0.7HA 
eth0.8HA eth0HA eth1-2HA eth1-3HA eth1-pubHA eth1.2HA eth1.97HA 
eth1.98HA eth1.99HA eth0.27HA eth0.28HA eth0.29HA eth0.30HA \
         meta target-role="Started" \
         meta globally-unique="false" target-role="Started"

[...]

colocation c_foo inf: ( bind ldirectord ldirectordBDD 
ldirectordMasterSlave openvpn stunnel ) IPHA firewall
order o_foo inf: IPHA firewall ( bind ldirectord ldirectordBDD 
ldirectordMasterSlave openvpn stunnel )

Well, overall, sorry I didn't really helped you, just wanted to 
highlight some configuration tweaks as I run almost the same cluster.

Cheers.

-- 
Cheers,
Florian Crouzat