[Pacemaker] no failover if fencing device is unreachable (i.e. power loss)

Mon Aug 18 13:53:46 EDT 2014

On 18/08/14 01:50 PM, Felix Schrage wrote:
> Hi,
>
> I'am building a two-node cluster running XenServer, pacemaker and DRBD. There's a problem when testing the failover by powering off the current active node.
> When using the fence_xenapi agent, the resource ClusterIP will not be moved to the 2nd node until the first node was successfully shut down.
> However  because the XenAPI is unreachable when the machine is powered off, the 2nd node continuously is trying to shut down the node and the resource is never moved.
>
> To check if it's an error with the fence_xenapi-agent I tried fence_ipmilan which is working fine as long as the IPMI is is reachable. When pulling the power cords from the machine
> however the behavior is the same as with the fence_xenapi agent.
> Am I missing an option which should be set? A timeout or a retry counter?

This is the expected behaviour. Being unable to connect to the fence 
device (or to fail to confirm the "off" action) can not be treated as a 
successful fence. Without a successful fence, it can not be assumed that 
the peer is gone. To do so would be to risk a split-brain, so the 
cluster's only sane and safe option is to block.

For this reason, this is why we always use switched PDUs as a backup 
fence method. You can see how to configure this with STONITH levels:

http://clusterlabs.org/wiki/STONITH_Levels

> Here's how I setup the cluster (fence_xenapi) using pcs:
>
> pcs cluster cib ftp_ha_cluster
> pcs -f ftp_ha_cluster resource create ClusterIP IPaddr2 ip=172.20.150.150 cidr_netmask=32 op monitor interval=20s
> pcs -f ftp_ha_cluster constraint location ClusterIP prefers ftp-test01=50
> pcs -f ftp_ha_cluster stonith create xenvm-fence-ftp1 fence_xenapi pcmk_host_list="ftp-test01" action="off" session_url="https://test-xen-01" port="ftp-test01" login="root" passwd="****" delay=15 op monitor interval=40s
> pcs -f ftp_ha_cluster stonith create xenvm-fence-ftp2 fence_xenapi pcmk_host_list="ftp-test02" action="off" session_url="https://test-xen-02" port="ftp-test02" login="root" passwd="****" delay=15 op monitor interval=40s
> pcs -f ftp_ha_cluster constraint location xenvm-fence-ftp1 prefers ftp-test01=-INFINITY
> pcs -f ftp_ha_cluster constraint location xenvm-fence-ftp2 prefers ftp-test02=-INFINITY
> pcs -f ftp_ha_cluster property set stonith-enabled=true
> pcs -f ftp_ha_cluster property set stonith-action=off
> pcs -f ftp_ha_cluster property set stonith-timeout=40s
> pcs -f ftp_ha_cluster property set no-quorum-policy=ignore
> pcs -f ftp_ha_cluster resource create Ping ocf:pacemaker:ping dampen="5s" multiplier="100" host_list="172.20.150.1 172.20.150.151 172.20.150.152" attempts="3" op monitor interval=20s
> pcs -f ftp_ha_cluster resource clone Ping
> pcs -f ftp_ha_cluster constraint location ClusterIP rule score=-INF not_defined pingd or pingd lte 0
> pcs -f ftp_ha_cluster constraint location ClusterIP rule score=pingd defined pingd
> pcs cluster cib-push ftp_ha_cluster
>
> for testing with fence_ipmilan I replaced the appropriate lines with the following:
>
> pcs -f ftp_ha_cluster stonith create ipmi-fence-test-xen-01 fence_ipmilan pcmk_host_list="ftp-test01" action="off" ipaddr="test-xen-01-bmc.mercateo.lan" auth="password" login="admin" passwd="****" delay=15 op monitor interval=40s
> pcs -f ftp_ha_cluster stonith create ipmi-fence-test-xen-02 fence_ipmilan pcmk_host_list="ftp-test02" action="off" ipaddr="test-xen-02-bmc.mercateo.lan" auth="password" login="admin" passwd="****" delay=15 op monitor interval=40s
> pcs -f ftp_ha_cluster constraint location ipmi-fence-test-xen-01 prefers ftp-test01=-INFINITY
> pcs -f ftp_ha_cluster constraint location ipmi-fence-test-xen-02 prefers ftp-test02=-INFINITY
>
>
> the content of /etc/corosync/corosync.conf:
>
> compatibility: whitetank
>
> totem {
> 	version: 2
> 	secauth: off
> 	threads: 0
> 	interface {
> 		ringnumber: 0
> 		bindnetaddr: 192.168.199.0
> 		mcastaddr: 226.94.1.1
> 		mcastport: 5405
> 		ttl: 1
> 	}
> }
>
> logging {
> 	fileline: off
> 	to_stderr: no
> 	to_logfile: yes
> 	to_syslog: no
> 	logfile: /var/log/cluster/corosync.log
> 	debug: off
> 	timestamp: on
> 	logger_subsys {
> 		subsys: AMF
> 		debug: off
> 	}
> }
>
> amf {
> 	mode: disabled
> }
>
> service {
> 	ver:	1
> 	name:	pacemaker
> }
>
> Any idea what could be missing/wrong?
>
> Kind regards,
>
> Felix
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?