[Pacemaker] How to speed up failover on node failure and network outage

Fri Feb 18 08:45:13 EST 2011

Hello *,

I have an interesting problem at a customer installation site:

1. The failover on node failure (unplugging the power cords) takes about 20s.
2. The failover on network outage (unplugging the network cable of the active 
node) takes about 40s.

The setup is as follows:

heartbeat 3.0.3 from Debian "Squeeze"
pacemaker 1.0.9.1 from Debian "Squeeze"

- 2 nodes, two network connections between the nodes for hb
- a drbd master-slave
- a group, started on the drbd master, with following components:
  * a filesystem (on the drbd)
  * an IP address
  * a postgresql database
  * two LSB scripts
- an ocf:paemaker:ping clone on both nodes to detect network outages

A failover time of about 2-3s for both node and network failure is required by 
the customer.

This is due to the setup before the drbd, postgresql etc was added:

A heartbeat-2 setup with one group, containing only one IP Address and an LSB 
script, with single network connection between the nodes, no pingd/ipfail 
setup. The deadtime was set to 2s, so the cluster would indeed failover within 
2-3s on node failure. A network outage would have caused a split-brain 
situation, and the standby node to go active within 2-3s.

Now, with drbd in place, abusing the split brain situation this way is beyond 
question, but the fast failover time is still required.

Is it possible to substantially speed up the failover times?

Basically, I am seeking for one of the following possibilities:

1. It is possible to get the times down, by tuning the configuration or by 
using some patches from hg (I noticed a lot of "speedup enhancements" in 
pacemaker 1.2)

2. It could be done, but there has to be done some development work - my 
customer is willing to pay for development work in this issue.

3. It is not possible within the current way heartbeat/pacemaker works 
internally.

Best regards
Frederik Schüler

-- 
five times nine                              keep your business safe.
Inhaber: Frederik Schueler              Kirschgarten 15 21031 Hamburg
Tel: 040 219 84 844                             Mobil: 0170 298 28 47
Web: http://fivetimesnine.de/                    USt ID: DE-254646986
-------------- next part --------------
node $id="80e49a8c-48f9-4b83-98ed-247c3379c637" rollenserver1
node $id="d074ae53-cf19-4914-b93e-5ea478674856" rollenserver2
primitive IP ocf:heartbeat:IPaddr2 \
        params ip="10.212.4.250" nic="eth0" cidr_netmask="24" \
        op monitor interval="10s" timeout="20s" \
        op start interval="0" timeout="60s" \
        op stop interval="0" timeout="60s"
primitive drbd ocf:linbit:drbd \
        params drbd_resource="r0" \
        op monitor interval="120s" timeout="60s" \
        op start interval="0" timeout="240s" \
        op stop interval="0" timeout="100s"
primitive fs ocf:heartbeat:Filesystem \
        params device="/dev/drbd0" directory="/var/lib/postgresql/8.1/ha" fstype="ext3" options="noatime"
primitive pgsql ocf:heartbeat:pgsql \
        params pgctl="/usr/lib/postgresql/8.1/bin/pg_ctl" psql="/usr/lib/postgresql/8.1/bin/psql" pgdata="/var/lib/postgresql/8.1/ha" pgport="5433" pgdb="postgres" start_opt="-c config_file=/etc/postgresql/8.1/ha/postgresql.conf" logfile="/var/log/postgresql/postgresql-8.1-ha.log" \
        op monitor interval="120s" timeout="60s" \
        op start interval="0" timeout="120" \
        op stop interval="0" timeout="120s"
primitive ping ocf:pacemaker:ping \
        params host_list="10.212.4.242" dampen="2s" \
        op monitor interval="3s" timeout="5s" \
        op start interval="0" timeout="60s" \
        op stop interval="0" timeout="60s"
primitive rollenserver lsb:Rollenserver
primitive smsrelay lsb:SMSRelay
group Rollenserver IP rollenserver fs pgsql smsrelay
ms ms-drbd drbd \
        meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true" target-role="Started"
clone pingclone ping \
        meta globally-unique="false" target-role="Started"
location ms-drbd_master_on_connected_node ms-drbd \
        rule $id="ms-drbd_master_on_connected_node-rule" $role="master" -2000: not_defined pingd or pingd lte 0
colocation rollenserver_on_drbd inf: Rollenserver ms-drbd:Master
order rollenserver_after_drbd inf: ms-drbd:promote Rollenserver:start
property $id="cib-bootstrap-options" \
        dc-version="1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b" \
        cluster-infrastructure="Heartbeat" \
        expected-quorum-votes="2" \
        no-quorum-policy="ignore" \
        stonith-enabled="false" \
        default-resource-stickiness="100" \
        last-lrm-refresh="1297176953" \
        dc-deadtime="60s" \
        symmetric-cluster="true"
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part.
URL: <http://lists.clusterlabs.org/pipermail/pacemaker/attachments/20110218/b8c28bd1/attachment-0002.sig>