[ClusterLabs] pacemaker with sbd fails to start if node reboots too fast.

Wed Nov 22 14:45:19 EST 2017

On 11/22/2017 08:01 PM, Andrei Borzenkov wrote:
> SLES12 SP2 with pacemaker 1.1.15-21.1-e174ec8; two node cluster with
> VM on VSphere using shared VMDK as SBD. During basic tests by killing
> corosync and forcing STONITH pacemaker was not started after reboot.
> In logs I see during boot
Using a two node cluster with a single shared disk might
be dangerous if using sbd before 1.3.1. (if pacemaker-watcher
is enabled a loss of the virtual-disk will make the node
fall back to quorum  - which doesn't really tell much in case
of two node clusters - so your disk will possibly become a
single point of failure - even worse you will get corruption
if the disk is lost - the side that is still able to write to the
disk will think it has fenced the other while that doesn't see
the poison-pill but is still happy having quorum due to the
two node corosync feature)
>
> Nov 22 16:04:56 sapprod01s crmd[3151]:     crit: We were allegedly
> just fenced by sapprod01p for sapprod01p
> Nov 22 16:04:56 sapprod01s pacemakerd[3137]:  warning: The crmd
> process (3151) can no longer be respawned,
> Nov 22 16:04:56 sapprod01s pacemakerd[3137]:   notice: Shutting down Pacemaker
>
> SBD timeouts are 60s for watchdog and 120s for msgwait. It seems that
> stonith with SBD always takes msgwait (at least, visually host is not
> declared as OFFLINE until 120s passed). But VM rebots lightning fast
> and is up and running long before timeout expires.
>
> I think I have seen similar report already. Is it something that can
> be fixed by SBD/pacemaker tuning?
Don't know it from sbd but have seen where fencing using
the cycle-method with machines that boot quickly leads to
strange behavior.
If you configure sbd to not clear the disk-slot on startup
(SBD_START_MODE=clean) it should be left to the other
side to do that which should prevent the other node from
coming up while the one fencing is still waiting. You might
set the method from cycle to off/on to make the fencing
side clean the slot.

>
> I can provide full logs tomorrow if needed.
Yes would be interesting to see more ...

If what I'm writing doesn't make too much sense
to you this might be due to me not really knowing
how sbd is configured with SLES ;-)

Regards,
Klaus
>
> TIA
>
> -andrei
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org