[ClusterLabs] Can a two node cluster start resources if only one node is booted?
Klaus Wenninger
kwenning at redhat.com
Fri Apr 22 04:06:22 EDT 2022
On Thu, Apr 21, 2022 at 8:18 PM john tillman <johnt at panix.com> wrote:
>
> > On 21.04.2022 18:26, john tillman wrote:
> >>> Dne 20. 04. 22 v 20:21 john tillman napsal(a):
> >>>>> On 20.04.2022 19:53, john tillman wrote:
> >>>>>> I have a two node cluster that won't start any resources if only one
> >>>>>> node
> >>>>>> is booted; the pacemaker service does not start.
> >>>>>>
> >>>>>> Once the second node boots up, the first node will start pacemaker
> >>>>>> and
> >>>>>> the
> >>>>>> resources are started. All is well. But I would like the resources
> >>>>>> to
> >>>>>> start when the first node boots by itself.
> >>>>>>
> >>>>>> I thought the problem was with the wait_for_all option but I have it
> >>>>>> set
> >>>>>> to "0".
> >>>>>>
> >>>>>> On the node that is booted by itself, when I run
> >>>>>> "corosync-quorumtool"
> >>>>>> I
> >>>>>> see:
> >>>>>>
> >>>>>> [root at test00 ~]# corosync-quorumtool
> >>>>>> Quorum information
> >>>>>> ------------------
> >>>>>> Date: Wed Apr 20 16:05:07 2022
> >>>>>> Quorum provider: corosync_votequorum
> >>>>>> Nodes: 1
> >>>>>> Node ID: 1
> >>>>>> Ring ID: 1.2f
> >>>>>> Quorate: Yes
> >>>>>>
> >>>>>> Votequorum information
> >>>>>> ----------------------
> >>>>>> Expected votes: 2
> >>>>>> Highest expected: 2
> >>>>>> Total votes: 1
> >>>>>> Quorum: 1
> >>>>>> Flags: 2Node Quorate
> >>>>>>
> >>>>>> Membership information
> >>>>>> ----------------------
> >>>>>> Nodeid Votes Name
> >>>>>> 1 1 test00 (local)
> >>>>>>
> >>>>>>
> >>>>>> My config file look like this:
> >>>>>> totem {
> >>>>>> version: 2
> >>>>>> cluster_name: testha
> >>>>>> transport: knet
> >>>>>> crypto_cipher: aes256
> >>>>>> crypto_hash: sha256
> >>>>>> }
> >>>>>>
> >>>>>> nodelist {
> >>>>>> node {
> >>>>>> ring0_addr: test00
> >>>>>> name: test00
> >>>>>> nodeid: 1
> >>>>>> }
> >>>>>>
> >>>>>> node {
> >>>>>> ring0_addr: test01
> >>>>>> name: test01
> >>>>>> nodeid: 2
> >>>>>> }
> >>>>>> }
> >>>>>>
> >>>>>> quorum {
> >>>>>> provider: corosync_votequorum
> >>>>>> two_node: 1
> >>>>>> wait_for_all: 0
> >>>>>> }
> >>>>>>
> >>>>>> logging {
> >>>>>> to_logfile: yes
> >>>>>> logfile: /var/log/cluster/corosync.log
> >>>>>> to_syslog: yes
> >>>>>> timestamp: on
> >>>>>> debug: on
> >>>>>> syslog_priority: debug
> >>>>>> logfile_priority: debug
> >>>>>> }
> >>>>>>
> >>>>>> Fencing is disabled.
> >>>>>>
> >>>>>
> >>>>> That won't work.
> >>>>>
> >>>>>> I've also looked in "corosync.log" but I don't know what to look for
> >>>>>> to
> >>>>>> diagnose this issue. I mean there are many lines similar to:
> >>>>>> [QUORUM] This node is within the primary component and will provide
> >>>>>> service.
> >>>>>> and
> >>>>>> [VOTEQ ] Sending quorum callback, quorate = 1
> >>>>>> and
> >>>>>> [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: Yes
> >>>>>> Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins:
> >>>>>> No
> >>>>>>
> >>>>>> Is there something specific I should look for in the log?
> >>>>>>
> >>>>>> So can a two node cluster work after booting only one node? Maybe
> >>>>>> it
> >>>>>> never will and I am wasting a lot of time, yours and mine.
> >>>>>>
> >>>>>> If it can, what else can I investigate further?
> >>>>>>
> >>>>>
> >>>>> Before node can start handling resources it needs to know status of
> >>>>> other node. Without successful fencing there is no way to accomplish
> >>>>> it.
> >>>>>
> >>>>> Yes, you can tell pacemaker to ignore unknown status. Depending on
> >>>>> your
> >>>>> resources this could simply prevent normal work or lead to data
> >>>>> corruption.
> >>>>
> >>>>
> >>>> Makes sense. Thank you.
> >>>>
> >>>> Perhaps some future enhancement could allow for this situation? I
> >>>> mean,
> >>>> It might be desirable for some cases to allow for a single node to
> >>>> boot,
> >>>> determine quorum by two_node=1 and wait_for_all=0, and start resources
> >>>> without ever seeing the other node. Sure, there are dangers of split
> >>>> brain but I can see special cases where I want the node to work alone
> >>>> for
> >>>> a period of time despite the danger.
> >>>>
> >>>
> >>> Hi John,
> >>>
> >>> How about 'pcs quorum unblock'?
> >>>
> >>> Regards,
> >>> Tomas
> >>>
> >>
> >>
> >> Tomas,
> >>
> >> Thank you for the suggestion. However it didn't work. It returned:
> >> Error: unable to check quorum status
> >> crm_mon: Error: cluster is not available on this node
> >> I checked pacemaker, just in case, and it still isn't running.
> >>
> >
> > Either pacemaker or some service it depends upon attempted to start and
> > failed or systemd still waits for some service that is required before
> > pacemaker. Checks logs or provide "journalctl -b" output in this state.
> >
> >
>
>
> I looked at pacemaker's log and it does not have any updates since the
> system was shutdown. When we booted the node, if it had started and
> failed or started and was stopped by systemd there would be something in
> this log, no?
>
> journalctl -b is lengthy and I'd rather not attach here but I grep'd
> through it and I can't find any pacemaker references. No errors reported
> from systemd.
>
> Once the other node is started, something starts the pacemaker service.
> pacemaker log starts filling up. journalctl -b sees plenty of pacemaker
> entires. crm_mon and pcs status are working right and show the cluster in
> a good state with all resources started properly.
>
> So I don't see anything stopping pacemaker from starting at boot. It
> looks like some piece of cluster software is starting it once the second
> node is online. Maybe corosync? Although the corosync log doesn't
> mention the start of anything. All it logs is seeing the second node
> join.
>
> So what starts pacemaker in this case?
Definitely a good question!
With systemd-unit-files as usually distributed this behavior is hard to
explain. So first thing I would check is pacemaker-unit-file (check all
locations that might overrule the one that is coming with pacemaker).
Maybe sbdy tried to implement something similar as wait-for-all
by checking if corosync reports the cluster-partition as quorate
before starting pacemaker.
But actually I then would expect 'journalctl -u pacemaker' to show
some sign of starting the unit. (should be found in 'journalctl -b'
of course as well)
Klaus
>
> Thank you for the response.
>
> -John
>
>
> >> I very curious how I could convince the cluster to start its resources
> >> on
> >> one node in the event that the other node is not able to boot. But I'm
> >> afraid the answer is either to use fencing or add a third node to the
> >> cluster or both.
> >>
> >> -John
> >>
> >>
> >>>> Thank you again.
> >>>>
> >>>>
> >>>>> _______________________________________________
> >>>>> Manage your subscription:
> >>>>> https://lists.clusterlabs.org/mailman/listinfo/users
> >>>>>
> >>>>> ClusterLabs home: https://www.clusterlabs.org/
> >>>>>
> >>>>>
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> Manage your subscription:
> >>>> https://lists.clusterlabs.org/mailman/listinfo/users
> >>>>
> >>>> ClusterLabs home: https://www.clusterlabs.org/
> >>>>
> >>>
> >>> _______________________________________________
> >>> Manage your subscription:
> >>> https://lists.clusterlabs.org/mailman/listinfo/users
> >>>
> >>> ClusterLabs home: https://www.clusterlabs.org/
> >>>
> >>>
> >>
> >>
> >> _______________________________________________
> >> Manage your subscription:
> >> https://lists.clusterlabs.org/mailman/listinfo/users
> >>
> >> ClusterLabs home: https://www.clusterlabs.org/
> >
> > _______________________________________________
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > ClusterLabs home: https://www.clusterlabs.org/
> >
> >
>
>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
More information about the Users
mailing list