[ClusterLabs] question about dc-deadtime
Ken Gaillot
kgaillot at redhat.com
Tue Jan 10 13:41:04 EST 2017
On 01/10/2017 11:59 AM, Chris Walker wrote:
>
>
> On Mon, Jan 9, 2017 at 6:55 PM, Andrew Beekhof <abeekhof at redhat.com
> <mailto:abeekhof at redhat.com>> wrote:
>
> On Fri, Dec 16, 2016 at 8:52 AM, Chris Walker
> <christopher.walker at gmail.com <mailto:christopher.walker at gmail.com>>
> wrote:
> > Thanks for your response Ken. I'm puzzled ... in my case node remain
> > UNCLEAN (offline) until dc-deadtime expires, even when both nodes are up and
> > corosync is quorate.
>
> I'm guessing you're starting both nodes at the same time?
>
>
> The nodes power on at the same time, but hardware discovery can vary by
> minutes.
>
>
>
> The behaviour you're seeing is arguably a hangover from the multicast
> days (in which case corosync wouldn't have had a node list).
>
>
> That makes sense.
>
>
> But since that's not the common case anymore, we could probably
> shortcut the timeout if we know the complete node list and see that
> they are all online.
>
>
> That would be ideal. It's easy enough to work around this in systemd,
> but it seems like the HA stack should be the authority on node status.
I've open a feature request:
http://bugs.clusterlabs.org/show_bug.cgi?id=5310
FYI the priority list is long at this point, so no idea when it might be
addressed.
> Thanks!
> Chris
>
>
> >
> > I see the following from crmd when I have dc-deadtime=2min
> >
> > Dec 15 21:34:33 max04 crmd[13791]: notice: Quorum acquired
> > Dec 15 21:34:33 max04 crmd[13791]: notice:
> pcmk_quorum_notification: Node
> > max04[2886730248] - state is now member (was (null))
> > Dec 15 21:34:33 max04 crmd[13791]: notice:
> pcmk_quorum_notification: Node
> > (null)[2886730249] - state is now member (was (null))
> > Dec 15 21:34:33 max04 crmd[13791]: notice: Notifications disabled
> > Dec 15 21:34:33 max04 crmd[13791]: notice: The local CRM is
> operational
> > Dec 15 21:34:33 max04 crmd[13791]: notice: State transition
> S_STARTING ->
> > S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL origin=do_started ]
> > ...
> > Dec 15 21:36:33 max05 crmd[10365]: warning: FSA: Input
> I_DC_TIMEOUT from
> > crm_timer_popped() received in state S_PENDING
> > Dec 15 21:36:33 max05 crmd[10365]: notice: State transition
> S_ELECTION ->
> > S_INTEGRATION [ input=I_ELECTION_DC cause=C_TIMER_POPPED
> > origin=election_timeout_popped ]
> > Dec 15 21:36:33 max05 crmd[10365]: warning: FSA: Input
> I_ELECTION_DC from
> > do_election_check() received in state S_INTEGRATION
> > Dec 15 21:36:33 max05 crmd[10365]: notice: Notifications disabled
> > Dec 15 21:36:33 max04 crmd[13791]: notice: State transition
> S_PENDING ->
> > S_NOT_DC [ input=I_NOT_DC cause=C_HA_MESSAGE
> > origin=do_cl_join_finalize_respond ]
> >
> > only after this do the nodes transition to Online. This is using the
> > vanilla RHEL7.2 cluster stack and the following options:
> >
> > property cib-bootstrap-options: \
> > no-quorum-policy=ignore \
> > default-action-timeout=120s \
> > pe-warn-series-max=1500 \
> > pe-input-series-max=1500 \
> > pe-error-series-max=1500 \
> > stonith-action=poweroff \
> > stonith-timeout=900 \
> > dc-deadtime=2min \
> > maintenance-mode=false \
> > have-watchdog=false \
> > dc-version=1.1.13-10.el7-44eb2dd \
> > cluster-infrastructure=corosync
> >
> > Thanks again,
> > Chris
> >
> > On Thu, Dec 15, 2016 at 3:26 PM, Ken Gaillot <kgaillot at redhat.com
> <mailto:kgaillot at redhat.com>> wrote:
> >>
> >> On 12/15/2016 02:00 PM, Chris Walker wrote:
> >> > Hello,
> >> >
> >> > I have a quick question about dc-deadtime. I believe that
> Digimer and
> >> > others on this list might have already addressed this, but I
> want to
> >> > make sure I'm not missing something.
> >> >
> >> > If my understanding is correct, dc-deadtime sets the amount of
> time that
> >> > must elapse before a cluster is formed (DC is elected, etc),
> regardless
> >> > of which nodes have joined the cluster. In other words, even
> if all
> >> > nodes that are explicitly enumerated in the nodelist section have
> >> > started Pacemaker, they will still wait dc-deadtime before
> forming a
> >> > cluster.
> >> >
> >> > In my case, I have a two-node cluster on which I'd like to allow a
> >> > pretty long time (~5 minutes) for both nodes to join before
> giving up on
> >> > them. However, if they both join quickly, I'd like to proceed
> to form a
> >> > cluster immediately; I don't want to wait for the full five
> minutes to
> >> > elapse before forming a cluster. Further, if a node doesn't
> respond
> >> > within five minutes, I want to fence it and start resources on
> the node
> >> > that is up.
> >>
> >> Pacemaker+corosync behaves as you describe by default.
> >>
> >> dc-deadtime is how long to wait for an election to finish, but if the
> >> election finishes sooner than that (i.e. a DC is elected), it stops
> >> waiting. It doesn't even wait for all nodes, just a quorum.
> >>
> >> Also, with startup-fencing=true (the default), any unseen nodes
> will be
> >> fenced, and the remaining nodes will proceed to host resources. Of
> >> course, it needs quorum for this, too.
> >>
> >> With two nodes, quorum is handled specially, but that's a
> different topic.
> >>
> >> > With Pacemaker/Heartbeat, the initdead parameter did exactly what I
> >> > want, but I don't see any way to do this with
> Pacemaker/Corosync. From
> >> > reading other posts, it looks like people use an external agent
> to start
> >> > HA daemons once nodes are up ... is this a correct understanding?
> >> >
> >> > Thanks very much,
> >> > Chris
More information about the Users
mailing list