[ClusterLabs] Pacemaker startup retries

Wed Sep 5 12:13:41 EDT 2018

On Wed, 2018-09-05 at 17:21 +0200, Cesar Hernandez wrote:
> > 
> > P.S. If the issue is just a matter of timing when you're starting
> > both
> > nodes, you can start corosync on both nodes first, then start
> > pacemaker
> > on both nodes. That way pacemaker on each node will immediately see
> > the
> > other node's presence.
> > -- 
> 
> Well rebooting a server lasts 2 minutes approximately. 
> I think I'm going to keep the same workaround I have on other
> servers:
> 
> -set crm stonith-timeout=300s
> -have a "sleep 180" in the fencing script, so the fencing will always
> last 3 minutes
> 
> So when crm fences a node on startup, the fencing script will return
> after 3 minutes. And at that time, the other node should be up and it
> won't be retried fencing
> 
> What you think about this workaround?
> 
> 
> The other solution would be updating pacemaker, but this 1.1.14 I
> have tested on many servers, and I don't want to take the risk to
> update to 1.1.15 and (maybe) have some other new issues...
> 
> Thanks a lot!
> Cesar

If you build from source, you can apply the patch that fixes the issue
to the 1.1.14 code base:

https://github.com/ClusterLabs/pacemaker/commit/98457d1635db1222f93599b6021e662e766ce62d
-- 
Ken Gaillot <kgaillot at redhat.com>