[ClusterLabs] Can a two node cluster start resources if only one node is booted?

Fri Apr 22 04:06:22 EDT 2022

On Thu, Apr 21, 2022 at 8:18 PM john tillman <johnt at panix.com> wrote:
>
> > On 21.04.2022 18:26, john tillman wrote:
> >>> Dne 20. 04. 22 v 20:21 john tillman napsal(a):
> >>>>> On 20.04.2022 19:53, john tillman wrote:
> >>>>>> I have a two node cluster that won't start any resources if only one
> >>>>>> node
> >>>>>> is booted; the pacemaker service does not start.
> >>>>>>
> >>>>>> Once the second node boots up, the first node will start pacemaker
> >>>>>> and
> >>>>>> the
> >>>>>> resources are started.  All is well.  But I would like the resources
> >>>>>> to
> >>>>>> start when the first node boots by itself.
> >>>>>>
> >>>>>> I thought the problem was with the wait_for_all option but I have it
> >>>>>> set
> >>>>>> to "0".
> >>>>>>
> >>>>>> On the node that is booted by itself, when I run
> >>>>>> "corosync-quorumtool"
> >>>>>> I
> >>>>>> see:
> >>>>>>
> >>>>>>     [root at test00 ~]# corosync-quorumtool
> >>>>>>     Quorum information
> >>>>>>     ------------------
> >>>>>>     Date:             Wed Apr 20 16:05:07 2022
> >>>>>>     Quorum provider:  corosync_votequorum
> >>>>>>     Nodes:            1
> >>>>>>     Node ID:          1
> >>>>>>     Ring ID:          1.2f
> >>>>>>     Quorate:          Yes
> >>>>>>
> >>>>>>     Votequorum information
> >>>>>>     ----------------------
> >>>>>>     Expected votes:   2
> >>>>>>     Highest expected: 2
> >>>>>>     Total votes:      1
> >>>>>>     Quorum:           1
> >>>>>>     Flags:            2Node Quorate
> >>>>>>
> >>>>>>     Membership information
> >>>>>>     ----------------------
> >>>>>>         Nodeid      Votes Name
> >>>>>>              1          1 test00 (local)
> >>>>>>
> >>>>>>
> >>>>>> My config file look like this:
> >>>>>>     totem {
> >>>>>>         version: 2
> >>>>>>         cluster_name: testha
> >>>>>>         transport: knet
> >>>>>>         crypto_cipher: aes256
> >>>>>>         crypto_hash: sha256
> >>>>>>     }
> >>>>>>
> >>>>>>     nodelist {
> >>>>>>         node {
> >>>>>>             ring0_addr: test00
> >>>>>>             name: test00
> >>>>>>             nodeid: 1
> >>>>>>         }
> >>>>>>
> >>>>>>         node {
> >>>>>>             ring0_addr: test01
> >>>>>>             name: test01
> >>>>>>             nodeid: 2
> >>>>>>         }
> >>>>>>     }
> >>>>>>
> >>>>>>     quorum {
> >>>>>>         provider: corosync_votequorum
> >>>>>>         two_node: 1
> >>>>>>         wait_for_all: 0
> >>>>>>     }
> >>>>>>
> >>>>>>     logging {
> >>>>>>         to_logfile: yes
> >>>>>>         logfile: /var/log/cluster/corosync.log
> >>>>>>         to_syslog: yes
> >>>>>>         timestamp: on
> >>>>>>         debug: on
> >>>>>>         syslog_priority: debug
> >>>>>>         logfile_priority: debug
> >>>>>>     }
> >>>>>>
> >>>>>> Fencing is disabled.
> >>>>>>
> >>>>>
> >>>>> That won't work.
> >>>>>
> >>>>>> I've also looked in "corosync.log" but I don't know what to look for
> >>>>>> to
> >>>>>> diagnose this issue.  I mean there are many lines similar to:
> >>>>>> [QUORUM] This node is within the primary component and will provide
> >>>>>> service.
> >>>>>> and
> >>>>>> [VOTEQ ] Sending quorum callback, quorate = 1
> >>>>>> and
> >>>>>> [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: Yes
> >>>>>> Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins:
> >>>>>> No
> >>>>>>
> >>>>>> Is there something specific I should look for in the log?
> >>>>>>
> >>>>>> So can a two node cluster work after booting only one node?  Maybe
> >>>>>> it
> >>>>>> never will and I am wasting a lot of time, yours and mine.
> >>>>>>
> >>>>>> If it can, what else can I investigate further?
> >>>>>>
> >>>>>
> >>>>> Before node can start handling resources it needs to know status of
> >>>>> other node. Without successful fencing there is no way to accomplish
> >>>>> it.
> >>>>>
> >>>>> Yes, you can tell pacemaker to ignore unknown status. Depending on
> >>>>> your
> >>>>> resources this could simply prevent normal work or lead to data
> >>>>> corruption.
> >>>>
> >>>>
> >>>> Makes sense.  Thank you.
> >>>>
> >>>> Perhaps some future enhancement could allow for this situation?  I
> >>>> mean,
> >>>> It might be desirable for some cases to allow for a single node to
> >>>> boot,
> >>>> determine quorum by two_node=1 and wait_for_all=0, and start resources
> >>>> without ever seeing the other node.  Sure, there are dangers of split
> >>>> brain but I can see special cases where I want the node to work alone
> >>>> for
> >>>> a period of time despite the danger.
> >>>>
> >>>
> >>> Hi John,
> >>>
> >>> How about 'pcs quorum unblock'?
> >>>
> >>> Regards,
> >>> Tomas
> >>>
> >>
> >>
> >> Tomas,
> >>
> >> Thank you for the suggestion.  However it didn't work.  It returned:
> >> Error: unable to check quorum status
> >>   crm_mon: Error: cluster is not available on this node
> >> I checked pacemaker, just in case, and it still isn't running.
> >>
> >
> > Either pacemaker or some service it depends upon attempted to start and
> > failed or systemd still waits for some service that is required before
> > pacemaker. Checks logs or provide "journalctl -b" output in this state.
> >
> >
>
>
> I looked at pacemaker's log and it does not have any updates since the
> system was shutdown.  When we booted the node, if it had started and
> failed or started and was stopped by systemd there would be something in
> this log, no?
>
> journalctl -b is lengthy and I'd rather not attach here but I grep'd
> through it and I can't find any pacemaker references.  No errors reported
> from systemd.
>
> Once the other node is started, something starts the pacemaker service.
> pacemaker log starts filling up.  journalctl -b sees plenty of pacemaker
> entires.  crm_mon and pcs status are working right and show the cluster in
> a good state with all resources started properly.
>
> So I don't see anything stopping pacemaker from starting at boot.  It
> looks like some piece of cluster software is starting it once the second
> node is online.  Maybe corosync?  Although the corosync log doesn't
> mention the start of anything.  All it logs is seeing the second node
> join.
>
> So what starts pacemaker in this case?

Definitely a good question!
With systemd-unit-files as usually distributed this behavior is hard to
explain. So first thing I would check is pacemaker-unit-file (check all
locations that might overrule the one that is coming with pacemaker).
Maybe sbdy tried to implement something similar as wait-for-all
by checking if corosync reports the cluster-partition as quorate
before starting pacemaker.
But actually I then would expect 'journalctl -u pacemaker' to show
some sign of starting the unit. (should be found in 'journalctl -b'
of course as well)

Klaus
>
> Thank you for the response.
>
> -John
>
>
> >> I very curious how I could convince the cluster to start its resources
> >> on
> >> one node in the event that the other node is not able to boot.  But I'm
> >> afraid the answer is either to use fencing or add a third node to the
> >> cluster or both.
> >>
> >> -John
> >>
> >>
> >>>> Thank you again.
> >>>>
> >>>>
> >>>>> _______________________________________________
> >>>>> Manage your subscription:
> >>>>> https://lists.clusterlabs.org/mailman/listinfo/users
> >>>>>
> >>>>> ClusterLabs home: https://www.clusterlabs.org/
> >>>>>
> >>>>>
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> Manage your subscription:
> >>>> https://lists.clusterlabs.org/mailman/listinfo/users
> >>>>
> >>>> ClusterLabs home: https://www.clusterlabs.org/
> >>>>
> >>>
> >>> _______________________________________________
> >>> Manage your subscription:
> >>> https://lists.clusterlabs.org/mailman/listinfo/users
> >>>
> >>> ClusterLabs home: https://www.clusterlabs.org/
> >>>
> >>>
> >>
> >>
> >> _______________________________________________
> >> Manage your subscription:
> >> https://lists.clusterlabs.org/mailman/listinfo/users
> >>
> >> ClusterLabs home: https://www.clusterlabs.org/
> >
> > _______________________________________________
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > ClusterLabs home: https://www.clusterlabs.org/
> >
> >
>
>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>