[Pacemaker] no failover if fencing device is unreachable (i.e. power loss)

Mon Aug 18 14:01:33 EDT 2014

Thanks for the quick answer. I'll have a look at that.
Is there a way to manually force a failover when I can be sure the other machine is down?

Kind regards

Felix

-----Ursprüngliche Nachricht-----
Von: Digimer [mailto:lists at alteeve.ca] 
Gesendet: Montag, 18. August 2014 19:57
An: The Pacemaker cluster resource manager
Betreff: Re: [Pacemaker] no failover if fencing device is unreachable (i.e. power loss)

On 18/08/14 01:50 PM, Felix Schrage wrote:
> Hi,
>
> I'am building a two-node cluster running XenServer, pacemaker and DRBD. There's a problem when testing the failover by powering off the current active node.
> When using the fence_xenapi agent, the resource ClusterIP will not be moved to the 2nd node until the first node was successfully shut down.
> However  because the XenAPI is unreachable when the machine is powered off, the 2nd node continuously is trying to shut down the node and the resource is never moved.
>
> To check if it's an error with the fence_xenapi-agent I tried 
> fence_ipmilan which is working fine as long as the IPMI is is reachable. When pulling the power cords from the machine however the behavior is the same as with the fence_xenapi agent.
> Am I missing an option which should be set? A timeout or a retry counter?

This is the expected behaviour. Being unable to connect to the fence device (or to fail to confirm the "off" action) can not be treated as a successful fence. Without a successful fence, it can not be assumed that the peer is gone. To do so would be to risk a split-brain, so the cluster's only sane and safe option is to block.

For this reason, this is why we always use switched PDUs as a backup fence method. You can see how to configure this with STONITH levels:

http://clusterlabs.org/wiki/STONITH_Levels

> Here's how I setup the cluster (fence_xenapi) using pcs:
>
> pcs cluster cib ftp_ha_cluster
> pcs -f ftp_ha_cluster resource create ClusterIP IPaddr2 
> ip=172.20.150.150 cidr_netmask=32 op monitor interval=20s pcs -f 
> ftp_ha_cluster constraint location ClusterIP prefers ftp-test01=50 pcs 
> -f ftp_ha_cluster stonith create xenvm-fence-ftp1 fence_xenapi 
> pcmk_host_list="ftp-test01" action="off" 
> session_url="https://test-xen-01" port="ftp-test01" login="root" 
> passwd="****" delay=15 op monitor interval=40s pcs -f ftp_ha_cluster 
> stonith create xenvm-fence-ftp2 fence_xenapi 
> pcmk_host_list="ftp-test02" action="off" 
> session_url="https://test-xen-02" port="ftp-test02" login="root" 
> passwd="****" delay=15 op monitor interval=40s pcs -f ftp_ha_cluster 
> constraint location xenvm-fence-ftp1 prefers ftp-test01=-INFINITY pcs 
> -f ftp_ha_cluster constraint location xenvm-fence-ftp2 prefers 
> ftp-test02=-INFINITY pcs -f ftp_ha_cluster property set 
> stonith-enabled=true pcs -f ftp_ha_cluster property set 
> stonith-action=off pcs -f ftp_ha_cluster property set 
> stonith-timeout=40s pcs -f ftp_ha_cluster property set 
> no-quorum-policy=ignore pcs -f ftp_ha_cluster resource create Ping 
> ocf:pacemaker:ping dampen="5s" multiplier="100" 
> host_list="172.20.150.1 172.20.150.151 172.20.150.152" attempts="3" op 
> monitor interval=20s pcs -f ftp_ha_cluster resource clone Ping pcs -f 
> ftp_ha_cluster constraint location ClusterIP rule score=-INF 
> not_defined pingd or pingd lte 0 pcs -f ftp_ha_cluster constraint 
> location ClusterIP rule score=pingd defined pingd pcs cluster cib-push 
> ftp_ha_cluster
>
> for testing with fence_ipmilan I replaced the appropriate lines with the following:
>
> pcs -f ftp_ha_cluster stonith create ipmi-fence-test-xen-01 
> fence_ipmilan pcmk_host_list="ftp-test01" action="off" 
> ipaddr="test-xen-01-bmc.mercateo.lan" auth="password" login="admin" 
> passwd="****" delay=15 op monitor interval=40s pcs -f ftp_ha_cluster 
> stonith create ipmi-fence-test-xen-02 fence_ipmilan 
> pcmk_host_list="ftp-test02" action="off" 
> ipaddr="test-xen-02-bmc.mercateo.lan" auth="password" login="admin" 
> passwd="****" delay=15 op monitor interval=40s pcs -f ftp_ha_cluster 
> constraint location ipmi-fence-test-xen-01 prefers 
> ftp-test01=-INFINITY pcs -f ftp_ha_cluster constraint location 
> ipmi-fence-test-xen-02 prefers ftp-test02=-INFINITY
>
>
> the content of /etc/corosync/corosync.conf:
>
> compatibility: whitetank
>
> totem {
> 	version: 2
> 	secauth: off
> 	threads: 0
> 	interface {
> 		ringnumber: 0
> 		bindnetaddr: 192.168.199.0
> 		mcastaddr: 226.94.1.1
> 		mcastport: 5405
> 		ttl: 1
> 	}
> }
>
> logging {
> 	fileline: off
> 	to_stderr: no
> 	to_logfile: yes
> 	to_syslog: no
> 	logfile: /var/log/cluster/corosync.log
> 	debug: off
> 	timestamp: on
> 	logger_subsys {
> 		subsys: AMF
> 		debug: off
> 	}
> }
>
> amf {
> 	mode: disabled
> }
>
> service {
> 	ver:	1
> 	name:	pacemaker
> }
>
> Any idea what could be missing/wrong?
>
> Kind regards,
>
> Felix
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org 
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org Getting started: 
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>

--
Digimer
Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education?

_______________________________________________
Pacemaker mailing list: Pacemaker at oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org