[ClusterLabs] Node add doesn't add node?

Fri Jan 11 07:53:10 EST 2019

On 11/01/19 00:16 +0000, Israel Brewster wrote:
> On Jan 10, 2019, at 10:57 AM, Israel Brewster <ibrewster at flyravn.com<mailto:ibrewster at flyravn.com>> wrote:
>> 
>> So in my ongoing work to upgrade my cluster to CentOS 7, I got one
>> box up and running on CentOS 7, with the cluster fully configured
>> and functional, and moved all my services over to it. Now I'm trying
>> to add a second node, following the directions here:
>> 
>> https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/high_availability_add-on_reference/s1-clusternodemanage-haar#s2-nodeadd-HAAR
>> 
>> However, it doesn't appear to be working. The existing node is named
>> "follow3", and the new node I am trying to add is named "follow1":
>> 
>> - The auth command run from follow3 returns "follow1: Authorized", so that looks good.
>> - The "pcs cluster node add follow1" command, again run on follow3, gives the following output:
>> 
>> Disabling SBD service...
>> follow1: sbd disabled
>> Sending remote node configuration files to 'follow1'
>> follow1: successful distribution of the file 'pacemaker_remote authkey'
>> follow3: Corosync updated
>> Setting up corosync...
>> follow1: Succeeded
>> Synchronizing pcsd certificates on nodes follow1...
>> follow1: Success
>> Restarting pcsd on the nodes in order to reload the certificates...
>> follow1: Success
>> 
>> ...So it would appear that that worked as well. I then issued the
>> "pcs cluster start --all" command, which gave the following output:
>> 
>> [root at follow3 ~]# pcs cluster start --all
>> follow3: Starting Cluster (corosync)...
>> follow1: Starting Cluster (corosync)...
>> follow3: Starting Cluster (pacemaker)...
>> follow1: Starting Cluster (pacemaker)...
>> 
>> So again, everything looks good (to me). However, when I run "pcs
>> status" on the existing node, I get the following:
>> 
>> [root at follow3 ~]# pcs status
>> Cluster name: follow
>> Stack: corosync
>> Current DC: follow3 (version 1.1.19-8.el7_6.2-c3c624ea3d) - partition with quorum
>> Last updated: Thu Jan 10 10:47:33 2019
>> Last change: Wed Jan  9 21:39:37 2019 by root via cibadmin on follow3
>> 
>> 1 node configured
>> 29 resources configured
>> 
>> Online: [ follow3 ]
>> 
>> Full list of resources:
>> 
>> which would seem to indicate that it doesn't know about the node I
>> just added (follow1). Meanwhile, follow1 "pcs status" shows this:
>> 
>> [root at follow1 ~]# pcs status
>> Cluster name: follow
>> Stack: corosync
>> Current DC: follow1 (version 1.1.19-8.el7_6.2-c3c624ea3d) - partition WITHOUT quorum
>> Last updated: Thu Jan 10 10:54:25 2019
>> Last change: Thu Jan 10 10:54:13 2019 by root via cibadmin on follow1
>> 
>> 2 nodes configured
>> 0 resources configured
>> 
>> Online: [ follow1 ]
>> OFFLINE: [ follow3 ]
>> 
>> No resources
>> 
>> 
>> Daemon Status:
>>   corosync: active/disabled
>>   pacemaker: active/disabled
>>   pcsd: active/enabled
>> 
>> So it got at least *some* of the config, but apparently not the full
>> thing (no resources), and it shows follow3 as offline, even though
>> it is online and reachable. Oddly "pcs cluster status" shows both
>> follow1 and follow3 pcsd status as online. What am I missing here?
> 
> As a follow-up to the above, restarting corosync on the functioning
> node (follow3) at least allows the second node (follow1) to show up
> when I do a pcs status, however the second node still shows as
> OFFLINE (and follow3 shows as offline on follow1), and follow1 is
> still missing pretty much all of the config. If I try to remove and
> re-add follow1, the removal works as expected (node count on follow3
> drops to 1), but the add behaves exactly the same as before, with
> pcs status not acknowledging the added node.

What do the logs on follow1 have to say about this?
E.g. journalctl -b --no-hostname -u corosync -u pacemaker, focusing
on the respective suspect time.

If there's nothing sufficiently explaining what actually happened,
you can still review the underlying pcs communication itself if you
pass --debug to it.

I suspect that simply one corosync instance doesn't see the other
for whatever reason (firewall, bad addresses or not on the same
network at all, addresses out of sync between particular nodes,
in corosync.conf, or possibly even in /etc/hosts or DNS source,
...).

-- 
Nazdar,
Jan (Poki)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 819 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20190111/4e6778f0/attachment-0002.sig>