[Pacemaker] When the ex-live server comes back online, it tries to failback causing a failure and restart in services

Fri Jan 17 00:33:56 EST 2014

Hi,

I have 2 servers setup with Postgres and /dev/drbd1 is mounted at /var/lib/pgsql. I also have pacemaker setup and it's setup to fail back and forth between the 2 nodes. It works really well for the most part.

I am having this one problem and it is happening to all 4 of my clusters. If the "web_services" resource group is running on database-2.hehe.org and I do a hard reset on it, it fails over fine and within a few seconds the DB is running on database-1.hehe.org. I turn the system back on and everything is fine. It comes back online with no issue and everything continues to run normally on database-1. crm_mon shows no errors at all, the node simply goes into online status.

HOWEVER, If I do a hard shutdown on database-1(or any of my primary nodes, ldap-1,idp-1,acc-1), it fails over to database-2 just fine. But, when it comes back into online status it seems like pacemaker tries to move the resources back to database-1, fails and then the services get restarted on database-2 because they are moving back.

Why is it that all of my 1st nodes are trying to take the resources back when they come back online but none of the 2nd nodes do this? Is there any way to prevent this? Can PaceMaker not check to see if the resources in the cluster are already running, and if so, just become an available node for the next time? 

I tried putting sticky resources to infinity. I have tried starting up the corosync/pacemaker service with the node in standby beforehand and it's always the same thing. Once node-1 is online, all the services on node-2 get interrupted trying to failback, which fails(probably just because drbd is already in use on the other end).

Here is my config:

node database-1.hehe.org \
        attributes standby="off"
node database-2.hehe.org \
        attributes standby="off"
primitive drbd_data ocf:linbit:drbd \
        params drbd_resource="res1" \
        op monitor interval="29s" role="Master" \
        op monitor interval="31s" role="Slave"
primitive fs_data ocf:heartbeat:Filesystem \
        params device="/dev/drbd1" directory="/var/lib/pgsql" fstype="ext4"
primitive httpd lsb:postgresql
primitive ip_httpd ocf:heartbeat:IPaddr2 \
        params ip="10.199.0.11"
group web_services fs_data ip_httpd httpd
ms ms_drbd_data drbd_data \
        meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true"
colocation web_services_on_drbd inf: httpd ms_drbd_data:Master
order web_services_after_drbd inf: ms_drbd_data:promote web_services:start
property $id="cib-bootstrap-options" \
        dc-version="1.1.10-14.el6_5.1-368c726" \
        cluster-infrastructure="classic openais (with plugin)" \
        expected-quorum-votes="2" \
        stonith-enabled="false" \
        no-quorum-policy="ignore" \
        last-lrm-refresh="1389926961"

Thanks