[Pacemaker] designing a load balancer - request for comments

Mon Feb 14 08:37:41 EST 2011

Am 11.02.2011 16:13, schrieb Raoul Bhatia [IPAX]:
> On 02/11/2011 03:07 PM, Klaus Darilion wrote:
...
>> Or, how should pacemaker behave if Kamailio on the active node crashes.
>> Shall it just restart Kamailio or shall it migrate the IP address to the
>> other node and then try to restart Kamailio on the inactive node?
> 
> pacemaker will not endlessly try to restart the configured resources:
> http://www.clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/s-failure-migration.html
> 
> pacemaker can be configured to restart a resource e.g. for a couple of
> times and if this does not work, it will migrate to another host.
> (you can also configure pacemaker to migrate upon the first failure)

Somehow pacemaker does not react as I would expect it. My config is:

primitive failover-ip ocf:heartbeat:IPaddr \
        params ip="83.136.32.161" \
        op monitor interval="3s"
primitive kamailio lsb:kamailio \
        meta migration-threshold="2" failure-timeout="60" \
        op monitor interval="15" timeout="15"
clone cloneKamailio kamailio
colocation colo_ip_with_kamailio inf: failover-ip cloneKamailio
property $id="cib-bootstrap-options" \
        dc-version="1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b" \
        cluster-infrastructure="openais" \
        expected-quorum-votes="2" \
        stonith-enabled="false" \
        no-quorum-policy="ignore"
rsc_defaults $id="rsc-options" \
        resource-stickiness="5"

At the beginning it is:

Operations:
* Node server1:
   kamailio:0: migration-threshold=2
    + (4) start: rc=0 (ok)
    + (5) monitor: interval=15000ms rc=0 (ok)
* Node server2:
   kamailio:1: migration-threshold=2
    + (4) start: rc=0 (ok)
    + (5) monitor: interval=15000ms rc=0 (ok)
   failover-ip: migration-threshold=1000000
    + (6) start: rc=0 (ok)
    + (7) monitor: interval=3000ms rc=0 (ok)

Then I stop Kamailio manually on server2. After some seconds pacemaker
detects that Kamailio is not running and restarts it:

Operations:
* Node server1:
   kamailio:0: migration-threshold=2
    + (4) start: rc=0 (ok)
    + (5) monitor: interval=15000ms rc=0 (ok)
* Node server2:
   kamailio:1: migration-threshold=2 fail-count=1 last-failure='Mon Feb
14 13:08:52 2011'
    + (8) stop: rc=0 (ok)
    + (9) start: rc=0 (ok)
    + (10) monitor: interval=15000ms rc=0 (ok)
   failover-ip: migration-threshold=1000000
    + (6) start: rc=0 (ok)
    + (7) monitor: interval=3000ms rc=0 (ok)

Then I wait a few minutes but the fail-count is still 1, although I
would expect that the timeout should clear the fail-count.

Then I stop Kamailio again. Pacemaker detects that Kamailio is not
running, increases the failure-count and migrates to other server.
(Kamailio is not restarted)

Operations:
* Node server1:
   kamailio:0: migration-threshold=2
    + (4) start: rc=0 (ok)
    + (5) monitor: interval=15000ms rc=0 (ok)
   failover-ip: migration-threshold=1000000
    + (6) start: rc=0 (ok)
    + (7) monitor: interval=3000ms rc=0 (ok)
* Node server2:
   kamailio:1: migration-threshold=2 fail-count=2 last-failure='Mon Feb
14 13:30:23 2011'
    + (9) start: rc=0 (ok)
    + (10) monitor: interval=15000ms rc=7 (not running)
    + (12) stop: rc=0 (ok)
   failover-ip: migration-threshold=1000000
    + (6) start: rc=0 (ok)
    + (7) monitor: interval=3000ms rc=0 (ok)
    + (11) stop: rc=0 (ok)

Failed actions:
    kamailio:1_monitor_15000 (node=server2, call=10, rc=7,
status=complete): not running

Then I wait a few minutes but the fail-count is still 2 and Kamailio is
still not restarted. From the documentation I would expect that the
failure-count would be reseted after failure-timeout="60" and Kamailio
should be started again on server2.

After 4 minutes Kamailio is restarted again, but the fail-count is still 2:

Operations:
* Node server1:
   kamailio:0: migration-threshold=2
    + (4) start: rc=0 (ok)
    + (5) monitor: interval=15000ms rc=0 (ok)
   failover-ip: migration-threshold=1000000
    + (6) start: rc=0 (ok)
    + (7) monitor: interval=3000ms rc=0 (ok)
* Node server2:
   kamailio:1: migration-threshold=2 fail-count=2 last-failure='Mon Feb
14 13:30:23 2011'
    + (12) stop: rc=0 (ok)
    + (13) start: rc=0 (ok)
    + (14) monitor: interval=15000ms rc=0 (ok)
   failover-ip: migration-threshold=1000000
    + (6) start: rc=0 (ok)
    + (7) monitor: interval=3000ms rc=0 (ok)
    + (11) stop: rc=0 (ok)

So, what am I doing wrong? I would expect that after 60s the
failure-count is resetted.

Thanks
Klaus