[ClusterLabs] Antw: Growing a cluster from 1 node without fencing

Mon Aug 14 10:30:38 EDT 2017

On 08/14/2017 03:12 PM, Edwin Török wrote:
> On 14/08/17 13:46, Klaus Wenninger wrote:
> > How does your /etc/sysconfig/sbd look like?
> > With just that pcs-command you get some default-config with
> > watchdog-only-support.
>
> It currently looks like this:
>
> SBD_DELAY_START=no
> SBD_OPTS="-n cluster1"
> SBD_PACEMAKER=yes
> SBD_STARTMODE=always
> SBD_WATCHDOG_DEV=/dev/watchdog
> SBD_WATCHDOG_TIMEOUT=5

Ok, no surprises there

>
> > Without cluster-property stonith-watchdog-timeout set to a
> > value matching (twice is a good choice) the watchdog-timeout
> > configured in /etc/sysconfig/sbd (default = 5s) a node will never
> > assume the unseen partner as fenced.
> > Anyway watchdog-only-sbd is of very limited use in 2-node
> > scenarios. Kind of limits the availability to the one of the node
> > that would win the tie_breaker-game. But might still be useful
> > in certain scenarios of course. (like load-sharing ...)
>
> Good point.

Still the question why you didn't set stonith-watchdog-timeout ...

>
>> On 08/14/2017 12:20 PM, Ulrich Windl wrote:
>>> Hi!
>>>
>>> Have you tried studying the logs? Usually you get useful information
>>> from
>>> there (to share!).
>
> Here is journalctl and pacemaker.log output:
>
> Aug 14 08:57:26 cluster1 crmd[2221]:   notice: Result of start
> operation for dlm on cluster1: 0 (ok)
> Aug 14 08:57:26 cluster1 sbd[2202]:       pcmk:     info:
> set_servant_health: Node state: online
> Aug 14 08:57:26 cluster1 sbd[2202]:       pcmk:     info:
> notify_parent: Notifying parent: healthy
> Aug 14 08:57:26 cluster1 sbd[2199]:   notice: inquisitor_child:
> Servant pcmk is healthy (age: 0)
> Aug 14 08:57:26 cluster1 sbd[2199]:   notice: inquisitor_child: Active
> cluster detected
> Aug 14 08:57:26 cluster1 crmd[2221]:   notice: Initiating monitor
> operation dlm:0_monitor_30000 locally on cluster1
> Aug 14 08:57:26 cluster1 crmd[2221]:   notice: Transition 0
> (Complete=5, Pending=0, Fired=0, Skipped=0, Incomplete=0,
> Source=/var/lib/pacemaker/pengine/pe-input-44.bz2): Complete
> Aug 14 08:57:26 cluster1 crmd[2221]:   notice: State transition
> S_TRANSITION_ENGINE -> S_IDLE
> Aug 14 08:57:27 cluster1 sbd[2203]:    cluster:     info:
> notify_parent: Notifying parent: healthy
> Aug 14 08:57:27 cluster1 sbd[2202]:       pcmk:     info:
> notify_parent: Notifying parent: healthy
> Aug 14 08:57:28 cluster1 sbd[2203]:    cluster:     info:
> notify_parent: Notifying parent: healthy
> Aug 14 08:57:28 cluster1 sbd[2202]:       pcmk:     info:
> notify_parent: Notifying parent: healthy
> Aug 14 08:57:28 cluster1 sbd[2202]:       pcmk:     info:
> notify_parent: Notifying parent: healthy
> Aug 14 08:57:29 cluster1 sbd[2203]:    cluster:     info:
> notify_parent: Notifying parent: healthy
> Aug 14 08:57:29 cluster1 sbd[2202]:       pcmk:     info:
> notify_parent: Notifying parent: healthy
> Aug 14 08:57:30 cluster1 corosync[2208]:  [CFG   ] Config reload
> requested by node 1
> Aug 14 08:57:30 cluster1 corosync[2208]:  [TOTEM ] adding new UDPU
> member {10.71.77.147}
> Aug 14 08:57:30 cluster1 corosync[2208]:  [QUORUM] This node is within
> the non-primary component and will NOT provide any services.
> Aug 14 08:57:30 cluster1 corosync[2208]:  [QUORUM] Members[1]: 1
> Aug 14 08:57:30 cluster1 crmd[2221]:  warning: Quorum lost
> Aug 14 08:57:30 cluster1 pacemakerd[2215]:  warning: Quorum lost
>
> ^^^^^^^^^ Looks unexpected

Not so familiar with how corosync handles dynamic config-changes.
Maybe you are on the loosing side of the tie-breaker or wait-for-all is
kicking in
if configured.
Would be interesting how 2-node-setting would handle that.
But 2-node-setting would of course break quorum-based-fencing.
If you have a disk you could use as shared-disk for sbd you could
achieve a quorum-disk-like-behavior. (your package-versions
look as if you are using RHEL-7.4)

>
>
> Aug 14 08:57:30 cluster1 sbd[2202]:       pcmk:     info:
> set_servant_health: Quorum lost: Ignore
> Aug 14 08:57:30 cluster1 sbd[2202]:       pcmk:     info:
> notify_parent: Not notifying parent: state transient (2)
> Aug 14 08:57:30 cluster1 sbd[2203]:    cluster:     info:
> notify_parent: Notifying parent: healthy
> Aug 14 08:57:30 cluster1 sbd[2202]:       pcmk:     info:
> notify_parent: Not notifying parent: state transient (2)
> Aug 14 08:57:31 cluster1 sbd[2203]:    cluster:     info:
> notify_parent: Notifying parent: healthy
> Aug 14 08:57:31 cluster1 sbd[2202]:       pcmk:     info:
> notify_parent: Not notifying parent: state transient (2)
> Aug 14 08:57:32 cluster1 sbd[2202]:       pcmk:     info:
> notify_parent: Not notifying parent: state transient (2)
> Aug 14 08:57:32 cluster1 sbd[2203]:    cluster:     info:
> notify_parent: Notifying parent: healthy
> Aug 14 08:57:32 cluster1 sbd[2202]:       pcmk:     info:
> notify_parent: Not notifying parent: state transient (2)
> Aug 14 08:57:33 cluster1 sbd[2203]:    cluster:     info:
> notify_parent: Notifying parent: healthy
> Aug 14 08:57:33 cluster1 sbd[2199]:  warning: inquisitor_child:
> Servant pcmk is outdated (age: 4)
> Aug 14 08:57:33 cluster1 sbd[2202]:       pcmk:     info:
> notify_parent: Not notifying parent: state transient (2)
> Aug 14 08:57:34 cluster1 sbd[2203]:    cluster:     info:
> notify_parent: Notifying parent: healthy
> Aug 14 08:57:34 cluster1 sbd[2202]:       pcmk:     info:
> notify_parent: Not notifying parent: state transient (2)
> Aug 14 08:57:35 cluster1 sbd[2203]:    cluster:     info:
> notify_parent: Notifying parent: healthy
> Aug 14 08:57:35 cluster1 sbd[2202]:       pcmk:     info:
> notify_parent: Not notifying parent: state transient (2)
> Aug 14 08:57:36 cluster1 sbd[2203]:    cluster:     info:
> notify_parent: Notifying parent: healthy
> Aug 14 08:57:36 cluster1 sbd[2199]:  warning: inquisitor_child:
> Latency: No liveness for 4 s exceeds threshold of 3 s (healthy
> servants: 0)
> Aug 14 08:57:36 cluster1 sbd[2202]:       pcmk:     info:
> notify_parent: Not notifying parent: state transient (2)
>

>From sbd-pov this the expected behavior.
sbd handles ignore, stop & freeze exactly the same by categorizing
the problem as something transient that might be overcome within
the watchdog-timeout.
In case of suicide it would suicide immediately.
Of course one might argue about if might make sense to not handle
all 3 configurations the same in sbd - but that is how it is configured
at the moment.

Regards,
Klaus

>
> Thanks,
> --Edwin
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org