[ClusterLabs] Cannot add a node with pcs

Piotr Szafarczyk piotr-l at netexpert.pl
Wed Jul 13 12:54:37 EDT 2022


Hi Tomas,

Thank you very much for the idea. I have played with stonith_admin 
--unfence and --confirm. Whenever I try, pcs status show my actions 
under Failed Fencing Actions. I see this in the log file:

error: Unfencing of n2 by <anyone> failed: No such device

No surprise here, since I have not got any devices registered.

If fencing of n2 was a cause, I would expect pcs status to show it as 
offline or unhealthy, but show it. I have got:

   * 2 nodes configured

Also I would expect node remove + node clear + node add to make n2 a 
brand new node.

Here are parts of the log when I remove n2 from the cluster

No peers with id=0 and/or uname=n2 to purge from the membership cache
Removing all n2 attributes for peer n3
Removing all n2 attributes for peer n1
Instructing peers to remove references to node n2/0
Completed cib_delete operation for section status: OK

There is nothing in the log file when I add it.

If fencing is the cause, where should I look for what the cluster tries 
to do?

Have you got any other suggestions what to check?

Best regards,
Piotr

On 12.07.2022 12:50, Tomas Jelinek wrote:
> Hi Piotr,
>
> Based on 'pcs cluster node add n2' and 'pcs config' outputs, pcs added 
> the node to your cluster successfully, that is corosync config has 
> been modified, distributed and loaded.
>
> It looks like the problem is with pacemaker. This is a wild guess, but 
> maybe pacemaker wants to fence n2, which is not possible, as you 
> disabled stonith. In the meantime, n1 and n3 do not allow n2 to join, 
> until it's confirmed fenced. Try looking into / posting 'pcs status 
> --full' and pacemaker log.
>
> With stonith disabled, you have a working cluster (seemingly). Until 
> you don't, due to an event which requires working stonith for the 
> cluster to recover.
>
> Regards,
> Tomas
>
>
> Dne 12. 07. 22 v 12:34 Piotr Szafarczyk napsal(a):
>> Hi,
>>
>> I used to have a working cluster with 3 nodes (and stonith disabled). 
>> After an unexpected restart of one node, the cluster split. The node 
>> #2 started to see the others as unclean. Nodes 1 and 2 were 
>> cooperating with each other, showing #2 as offline. There were no 
>> network connection problems.
>>
>> I removed #2 (operating from #1) with
>> pcs cluster node remove n2
>>
>> I verified that it had removed all configuration from #2, both for 
>> corosync and for pacemaker. The cluster looks like working correctly 
>> with two nodes (and no traces of #2).
>>
>> Now I am trying to add the third node back.
>> pcs cluster node add n2
>> Disabling SBD service...
>> n2: sbd disabled
>> Sending 'corosync authkey', 'pacemaker authkey' to 'n2'
>> n2: successful distribution of the file 'corosync authkey'
>> n2: successful distribution of the file 'pacemaker authkey'
>> Sending updated corosync.conf to nodes...
>> n3: Succeeded
>> n2: Succeeded
>> n1: Succeeded
>> n3: Corosync configuration reloaded
>>
>> I am able to start #2 operating from #1
>>
>> pcs cluster pcsd-status
>>    n2: Online
>>    n3: Online
>>    n1: Online
>>
>> pcs cluster enable n2
>> pcs cluster start n2
>>
>> I can see that corosync's configuration has been updated, but 
>> pacemaker's not.
>>
>> _Checking from #1:_
>>
>> pcs config
>> Cluster Name: n
>> Corosync Nodes:
>>   n1 n3 n2
>> Pacemaker Nodes:
>>   n1 n3
>> [...]
>>
>> pcs status
>>    * 2 nodes configured
>> Node List:
>>    * Online: [ n1 n3 ]
>> [...]
>>
>> pcs cluster cib scope=nodes
>> <nodes>
>>    <node id="1" uname="n1"/>
>>    <node id="3" uname="n3"/>
>> </nodes>
>>
>> _#2 is seeing the state differently:_
>>
>> pcs config
>> Cluster Name: n
>> Corosync Nodes:
>>   n1 n3 n2
>> Pacemaker Nodes:
>>   n1 n2 n3
>>
>> pcs status
>>    * 3 nodes configured
>> Node List:
>>    * Online: [ n2 ]
>>    * OFFLINE: [ n1 n3 ]
>> Full List of Resources:
>>    * No resources
>> [...]
>> (there are resources configured on #1 and #3)
>>
>> pcs cluster cib scope=nodes
>> <nodes>
>>    <node id="1" uname="n1"/>
>>    <node id="3" uname="n3"/>
>>    <node id="2" uname="n2"/>
>> </nodes>
>>
>> Help me diagnose it please. Where should I look for the problem? (I 
>> have already tried a few things more - I see nothing helpful in log 
>> files, pcs --debug shows nothing suspicious, tried even editing the 
>> CIB manually)
>>
>> Best regards,
>>
>> Piotr Szafarczyk
>>
>>
>> _______________________________________________
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> ClusterLabs home: https://www.clusterlabs.org/
>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/


More information about the Users mailing list