[ClusterLabs] Problem with stonith and starting services

Fri Jul 14 06:57:31 UTC 2017

On 07/12/2017 05:16 PM, Cesar Hernandez wrote:
>
>> El 6 jul 2017, a las 17:34, Ken Gaillot <kgaillot at redhat.com> escribió:
>>
>> On 07/06/2017 10:27 AM, Cesar Hernandez wrote:
>>>> It looks like a bug when the fenced node rejoins quickly enough that it
>>>> is a member again before its fencing confirmation has been sent. I know
>>>> there have been plenty of clusters with nodes that quickly reboot and
>>>> slow fencing devices, so that seems unlikely, but I don't see another
>>>> explanation.
>>>>
>>> Could it be caused if node 2 becomes rebooted and alive before the stonith script has finished?
>> That *shouldn't* cause any problems, but I'm not sure what's happening
>> in this case.
>
> So, this was the cause for the problem...
> Before the two servers I have now, I've made other 3 cluster installations with a different internet hosting provider. Using that provider, a machine lasted more than 2 minutes to reboot using the fencing script (slow boot process and slow remote api to respond)
> So I added a "sleep 90" before the end of the script and it always worked perfectly.
>
> Now, with a different provider, I used the same script, just changing the remote api for the provider api. In this case, a machine lasts aprox 10 seconds to do a full reboot, and also the api is faster (just 2 or 3 seconds to respond).
> So the machine was up again in less than 20 seconds. 
>
> I suppose the problem comes when the node (node2 for example) that has been rebooted sees that node1 is still waiting for the fencing script to finish (due to the sleep 90) and it just becomes confused and exits pacemaker.
>
> I changed that sleep 90 for a sleep 5 and it hasn't happened again
I guess pacemaker should be able to cope with a situation like that.

Using sbd-fencing (e.g. fence_sbd) you would actually have quite
a similar case. The fence-agent writes the poison-pill in the disk-slot
of the node to be fenced and usually this will be read out by the
victim-node within a second. But as guaranteed response-times
of the shared-disk in an enterprise-environment can be really
huge the fence-agent would still wait for 60s or so to be really
sure that the other side has swallowed the pill.

So if this is really the reason it would probably be worth
finding out what is really happening.

Regards,
Klaus

>
> Thanks a lot to everyone for the help
>
> Cheers
> Cesar
>
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

-- 
Klaus Wenninger

Senior Software Engineer, EMEA ENG Openstack Infrastructure

Red Hat

kwenning at redhat.com