[ClusterLabs] Problem with stonith and starting services

Wed Jul 12 15:16:49 UTC 2017

> El 6 jul 2017, a las 17:34, Ken Gaillot <kgaillot at redhat.com> escribió:
> 
> On 07/06/2017 10:27 AM, Cesar Hernandez wrote:
>> 
>>> 
>>> It looks like a bug when the fenced node rejoins quickly enough that it
>>> is a member again before its fencing confirmation has been sent. I know
>>> there have been plenty of clusters with nodes that quickly reboot and
>>> slow fencing devices, so that seems unlikely, but I don't see another
>>> explanation.
>>> 
>> 
>> Could it be caused if node 2 becomes rebooted and alive before the stonith script has finished?
> 
> That *shouldn't* cause any problems, but I'm not sure what's happening
> in this case.

So, this was the cause for the problem...
Before the two servers I have now, I've made other 3 cluster installations with a different internet hosting provider. Using that provider, a machine lasted more than 2 minutes to reboot using the fencing script (slow boot process and slow remote api to respond)
So I added a "sleep 90" before the end of the script and it always worked perfectly.

Now, with a different provider, I used the same script, just changing the remote api for the provider api. In this case, a machine lasts aprox 10 seconds to do a full reboot, and also the api is faster (just 2 or 3 seconds to respond).
So the machine was up again in less than 20 seconds. 

I suppose the problem comes when the node (node2 for example) that has been rebooted sees that node1 is still waiting for the fencing script to finish (due to the sleep 90) and it just becomes confused and exits pacemaker.

I changed that sleep 90 for a sleep 5 and it hasn't happened again

Thanks a lot to everyone for the help

Cheers
Cesar