[ClusterLabs] Pacemaker startup retries

Fri Aug 31 13:54:40 UTC 2018

On Fri, 2018-08-31 at 08:37 +0200, Cesar Hernandez wrote:
> Hi
> 
> > 
> > 
> > Do you mean you have a custom fencing agent configured? If so,
> > check
> > the return value of each attempt. Pacemaker should request fencing
> > only
> > once as long as it succeeds (returns 0), but if the agent fails
> > (returns nonzero or times out), it will retry, even if the reboot
> > worked in reality.
> > 
> 
> Yes, custom fencing agent, and it always returns 0 
> > 
> > 
> > FYI, corosync 2 has a "two_node" setting that includes
> > "wait_for_all"
> > -- with that, you don't need to ignore quorum in pacemaker, and the
> > cluster won't start until both nodes have seen each other at least
> > once.
> 
> Well I'm ok with the quorum behaviour but I want to know why it
> reboots 3 times on startup.
> When both nodes are up and running, and if one node stops responding,
> the other node fences it only 1 time, not 3
> > 
> > 
> 
> Do you know why it happens?
> 
> Thanks
> Cesar

Check the pacemaker logs on both bodes around the time it happens.

One of the nodes will be the DC, and will have "pengine:" logs with
"saving inputs".

The first thing I'd look for is who requested fencing. The DC will have
stonith logs with "Client ... wants to fence ...". The client will
either be crmd (i.e. the cluster itself) or some external program.

If it's the cluster, I'd look at the "pengine:" logs on the DC before
that, to see if there are any hints (node unclean, etc.). Then keep
going backward until the ultimate cause is found.
-- 
Ken Gaillot <kgaillot at redhat.com>