[ClusterLabs] Antw: Re: HA domain controller fences newly joined node after fence_ipmilan delay even if transition was aborted.

Ulrich Windl Ulrich.Windl at rz.uni-regensburg.de
Wed Dec 19 02:30:35 EST 2018


>>> Chris Walker <cwalker at cray.com> schrieb am 18.12.2018 um 17:13 in Nachricht
<ad34707f-ecd7-b58b-f31b-c5ca0f53b85a at cray.com>:

[...]
> 2.  As Ken mentioned, synchronize the starting of Corosync and Pacemaker.  I 
> did this with a simple ExecStartPre systemd script:
> 
> [root at bug0 ~]# cat /etc/systemd/system/corosync.service.d/ha_wait.conf
> [Service]
> ExecStartPre=/sbin/ha_wait.sh
> TimeoutStartSec=11min
> [root at bug0 ~]#
> 
> where ha_wait.sh has something like:
> 
> #!/bin/bash
> 
> timeout=600
> 
> peer=<hostname of HA peer>
> 
> echo "Waiting for ${peer}"
> peerup() {
>   systemctl -H ${peer} show -p ActiveState corosync.service 2> /dev/null | \
>     egrep -q "=active|=reloading|=failed|=activating|=deactivating" && return 
> 0
>   return 1
> }
> 
> now=${SECONDS}
> while ! peerup && [ $((SECONDS-now)) -lt ${timeout} ]; do
>   echo -n .
>   sleep 5
> done
> 
> peerup && echo "${peer} is up starting HA" || echo "${peer} not up after 
> ${timeout} starting HA alone"
> 
> 
> This will cause corosync startup to block for 10 minutes waiting for the 
> partner node to come up, after which both nodes will start corosync/pacemaker 
> close in time.  If one node never comes up, then it will wait 10 minutes 
> before starting, after which the other node will be fenced (startup fencing 
> and subsequent resource startup will only happen will only occur if 
> no-quorum-policy is set to ignore)

Hi!

I also missed such an option, because I knew it from HP-UX ServiceGuard: There was a delay to wait "for the cluster to form", meaning if all nodes are down, there is no cluster, and a new cluster "has to form". Then the first node to boot would not simply become a one-node cluster, but it will wait some configurable time for other nodes to join (come up). So either if all configured cluster nodes came up, or the configured time had elapsed, the "cluster would form". The advantage is that unneeded resource movements are avoided when other nodes come up shortly after a new cluster has formed...

That makes sense to me.

(ServiceGuard also had an option I miss in pacemaker: I could configure a network timeout that was ignored. So when I quickly replugged a cable (which would cause an "interface down" for at least five seconds) the cluster would NOT trigger any corrective actions. We had such a timeout set to 30 seconds or so, as most resource operations would take much longer to complete...)

Regards,
Ulrich

> 
> HTH,
> 
> Chris
[...]





More information about the Users mailing list