[Pacemaker] create 2-node Active/Passive firewall cluster

David Lang david at lang.hm
Thu Sep 19 14:04:23 UTC 2013


On Thu, 19 Sep 2013, Florian Crouzat wrote:

> Le 19/09/2013 11:43, David Lang a ?crit :
>> 
>> I've been running active/failover firewall clusters with heartbeat since
>> about 2000, and one suggestion that I would make. If you can leave all
>> the daemons running all the time, the failover process is far more
>> robust (and faster since you don't have daemons to start). If you set
>> net.ipv4.ip_nonlocal_bind you can even have the daemons startup binding
>> to the VIP addresses that don't yet exist.
>> 
>> If you do not have to have the daemons bound to the VIP, the fact that
>> they are always running on the backup box gives you a quick way to check
>> if a failover would solve the problem or not by having a client connect
>> directly to the second box. The drawback is that someone may configure
>> something to point directly at a box and not at a VIP and you won't
>> detect it (without log analysis) until the box they point at actually
>> goes down.
>> 
>> David Lang
>
> I never thought about that, it seems it could be interesting, especially with 
> slow (start|stop)ing daemons such as squid.

yes, if the daemons are started at boot time, you don't have to worry about some 
subtle config error creeping in that prevents them from running when you need 
them.

you can also monitor the availability of the backup firewall from your network 
monitoring systems. Nothing's worse than having your primary fail, only to 
discover that your backup wasn't working (especially over something like a bad 
route that's not detected by the HA software that just runs on the local subnet)


> In my case, my daemons would be protected by the "passive firewall state" 
> that my nodes have when they don't host resources.

Why?    I know, the real answer is 'because it's the standby, and standby boxes 
aren't active'. But is there really a need to do this? or it it just because?

If your systems are hardened to be a firewall, what difference does it make if 
they are exposed or 'proteted by the passive firewall state'?

what do you gain by changing your firewall rules when you switch between active 
and passive (and are you sure there is never an instant when your defenses are 
down during this switch, I bring up the iptables rules before bringing up the 
interfaces at boot)

if having something running on the primary and backup at the same time would 
cause a conflict, then the HA software needs to manage it (shared disk or IP is 
a good example), but otherwise it should be running at all times so that you 
know it's healthy (you can monitor it) and to reduce the work needed at failover 
time.

You should have both systems sending their logs to a central server, so from the 
point of view of knowing what's happening, there really shouldn't be a 
difference between the two systems, even if someone does deliberatly hit your 
'backup' box



and speaking of primary and backup, if the boxes are identical hardware, it 
really shouldn't matter which is active, so 'primary' and 'backup' are bad 
names. It's best practice to regularly excercise your backup systems, and so 
having your HA system treat the two as equal (except in the case of both booting 
at the same time or recovering from split-brain when you need to designate who 
wins the tie) lets you run for an extended time on either box

This also helps you avoid flapping where the primary has something wrong that 
slows it down so it can't handle full load, but could handle partial load. under 
load the primary fails, you failover to the backup, the primary recovers and 
looks healthy, so you failover to the primary, which goes down because of the 
load....

I've seen this be something as simple as blocked cooling where a box was fine 
when idle, but overheated (and therefor the CPU throttled down to slower speeds 
tutomatically) under load.

Ideally you do something like schedule a failover every month or quarter from 
one box to the other, and just keep running on that box until the next failover.

It does mean that you need to check which box is active when you work on them, 
but you should do that anyway :-)

David Lang




More information about the Pacemaker mailing list