[ClusterLabs] Node add doesn't add node?

Thu Jan 17 12:53:19 EST 2019

On Jan 11, 2019, at 3:53 AM, Jan Pokorný <jpokorny at redhat.com<mailto:jpokorny at redhat.com>> wrote:

On 11/01/19 00:16 +0000, Israel Brewster wrote:
On Jan 10, 2019, at 10:57 AM, Israel Brewster <ibrewster at flyravn.com<mailto:ibrewster at flyravn.com><mailto:ibrewster at flyravn.com>> wrote:

So in my ongoing work to upgrade my cluster to CentOS 7, I got one
box up and running on CentOS 7, with the cluster fully configured
and functional, and moved all my services over to it. Now I'm trying
to add a second node, following the directions here:

https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/high_availability_add-on_reference/s1-clusternodemanage-haar#s2-nodeadd-HAAR

However, it doesn't appear to be working. The existing node is named
"follow3", and the new node I am trying to add is named "follow1":

- The auth command run from follow3 returns "follow1: Authorized", so that looks good.
- The "pcs cluster node add follow1" command, again run on follow3, gives the following output:

Disabling SBD service...
follow1: sbd disabled
Sending remote node configuration files to 'follow1'
follow1: successful distribution of the file 'pacemaker_remote authkey'
follow3: Corosync updated
Setting up corosync...
follow1: Succeeded
Synchronizing pcsd certificates on nodes follow1...
follow1: Success
Restarting pcsd on the nodes in order to reload the certificates...
follow1: Success

...So it would appear that that worked as well. I then issued the
"pcs cluster start --all" command, which gave the following output:

[root at follow3 ~]# pcs cluster start --all
follow3: Starting Cluster (corosync)...
follow1: Starting Cluster (corosync)...
follow3: Starting Cluster (pacemaker)...
follow1: Starting Cluster (pacemaker)...

So again, everything looks good (to me). However, when I run "pcs
status" on the existing node, I get the following:

[root at follow3 ~]# pcs status
Cluster name: follow
Stack: corosync
Current DC: follow3 (version 1.1.19-8.el7_6.2-c3c624ea3d) - partition with quorum
Last updated: Thu Jan 10 10:47:33 2019
Last change: Wed Jan  9 21:39:37 2019 by root via cibadmin on follow3

1 node configured
29 resources configured

Online: [ follow3 ]

Full list of resources:

which would seem to indicate that it doesn't know about the node I
just added (follow1). Meanwhile, follow1 "pcs status" shows this:

[root at follow1 ~]# pcs status
Cluster name: follow
Stack: corosync
Current DC: follow1 (version 1.1.19-8.el7_6.2-c3c624ea3d) - partition WITHOUT quorum
Last updated: Thu Jan 10 10:54:25 2019
Last change: Thu Jan 10 10:54:13 2019 by root via cibadmin on follow1

2 nodes configured
0 resources configured

Online: [ follow1 ]
OFFLINE: [ follow3 ]

No resources

Daemon Status:
 corosync: active/disabled
 pacemaker: active/disabled
 pcsd: active/enabled

So it got at least *some* of the config, but apparently not the full
thing (no resources), and it shows follow3 as offline, even though
it is online and reachable. Oddly "pcs cluster status" shows both
follow1 and follow3 pcsd status as online. What am I missing here?

As a follow-up to the above, restarting corosync on the functioning
node (follow3) at least allows the second node (follow1) to show up
when I do a pcs status, however the second node still shows as
OFFLINE (and follow3 shows as offline on follow1), and follow1 is
still missing pretty much all of the config. If I try to remove and
re-add follow1, the removal works as expected (node count on follow3
drops to 1), but the add behaves exactly the same as before, with
pcs status not acknowledging the added node.

What do the logs on follow1 have to say about this?
E.g. journalctl -b --no-hostname -u corosync -u pacemaker, focusing
on the respective suspect time.

If there's nothing sufficiently explaining what actually happened,
you can still review the underlying pcs communication itself if you
pass --debug to it.

I suspect that simply one corosync instance doesn't see the other
for whatever reason (firewall, bad addresses or not on the same
network at all, addresses out of sync between particular nodes,
in corosync.conf, or possibly even in /etc/hosts or DNS source,
...).

So apparently this was something messed up on Follow3, although I don't know what. I ended up doing the following, which worked:

1) Set up a new VM ('follow4')
2) cluster it with follow1
3) Dump JUST the resources and constraints from follow3
4) load the above .xml files to the new cluster (follow1 and follow4)

Once I did the above, I was able to add an additional node (follow2) to the new follow1/follow4 cluster with no problems. So while I don't know what was going on with follow3, at least I now have a properly functioning cluster again!

--
Nazdar,
Jan (Poki)
_______________________________________________
Users mailing list: Users at clusterlabs.org<mailto:Users at clusterlabs.org>
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20190117/3e7b615a/attachment-0002.html>