[ClusterLabs] why is node fenced ?

Ken Gaillot kgaillot at redhat.com
Wed Aug 14 13:07:39 EDT 2019


On Wed, 2019-08-14 at 11:57 +0200, Lentes, Bernd wrote:
> 
> ----- On Aug 13, 2019, at 1:19 AM, kgaillot kgaillot at redhat.com
> wrote:
> 
> 
> > 
> > The key messages are:
> > 
> > Aug 09 17:43:27 [6326] ha-idg-1       crmd:     info:
> > crm_timer_popped: Election
> > Trigger (I_DC_TIMEOUT) just popped (20000ms)
> > Aug 09 17:43:27 [6326] ha-idg-1       crmd:  warning:
> > do_log:   Input
> > I_DC_TIMEOUT received in state S_PENDING from crm_timer_popped
> > 
> > That indicates the newly rebooted node didn't hear from the other
> > node
> > within 20s, and so assumed it was dead.
> > 
> > The new node had quorum, but never saw the other node's corosync,
> > so
> > I'm guessing you have two_node and/or wait_for_all disabled in
> > corosync.conf, and/or you have no-quorum-policy=ignore in
> > pacemaker.
> > 
> > I'd recommend two_node: 1 in corosync.conf, with no explicit
> > wait_for_all or no-quorum-policy setting. That would ensure a
> > rebooted/restarted node doesn't get initial quorum until it has
> > seen
> > the other node.
> 
> That's my setting:
> 
> expected_votes: 2
>       two_node: 1
>   wait_for_all: 0
> 
> no-quorum-policy=ignore
> 
> I did that because i want be able to start the cluster although one
> node has e.g. a hardware problem.
> Is that ok ?

Well that's why you're seeing what you're seeing, which is also why
wait_for_all was created :)

You definitely don't need no-quorum-policy=ignore in any case. With
two_node, corosync will continue to provide quorum to pacemaker when
one node goes away, so from pacemaker's view no-quorum-policy never
kicks in.

With wait_for_all enabled, the newly joining node wouldn't get quorum
initially, so it wouldn't fence the other node. So that's the trade-
off, preventing this situation vs being able to start one node alone
intentionally. Personally, I'd leave wait_for_all on normally, and
manually change it to 0 whenever I was intentionally taking one node
down for an extended time.

Of course all of that is just recovery, and doesn't explain why the
nodes can't see each other to begin with.

> 
> 
> Bernd
>  
> 
> Helmholtz Zentrum Muenchen
> Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
> Ingolstaedter Landstr. 1
> 85764 Neuherberg
> www.helmholtz-muenchen.de
> Aufsichtsratsvorsitzende: MinDir'in Prof. Dr. Veronika von Messling
> Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep,
> Heinrich Bassler, Kerstin Guenther
> Registergericht: Amtsgericht Muenchen HRB 6466
> USt-IdNr: DE 129521671
> 
-- 
Ken Gaillot <kgaillot at redhat.com>



More information about the Users mailing list