[ClusterLabs] Cannot add a node with pcs

Tue Aug 2 08:23:49 EDT 2022

Hi Piotr,

Sorry for the delay. I'm not a pacemaker expert, so I don't really know 
how pacemaker behaves in various corner cases. Even if I were, it would 
be difficult to advise you, since you haven't even posted what version 
of pacemaker / corosync / pcs you are using.

In any case, the first thing you need to do is configure stonith. 
Properly configured and working stonith is required for a cluster to 
operate. There is no way around it.

Regards,
Tomas

Dne 13. 07. 22 v 18:54 Piotr Szafarczyk napsal(a):
> Hi Tomas,
> 
> Thank you very much for the idea. I have played with stonith_admin 
> --unfence and --confirm. Whenever I try, pcs status show my actions 
> under Failed Fencing Actions. I see this in the log file:
> 
> error: Unfencing of n2 by <anyone> failed: No such device
> 
> No surprise here, since I have not got any devices registered.
> 
> If fencing of n2 was a cause, I would expect pcs status to show it as 
> offline or unhealthy, but show it. I have got:
> 
>    * 2 nodes configured
> 
> Also I would expect node remove + node clear + node add to make n2 a 
> brand new node.
> 
> Here are parts of the log when I remove n2 from the cluster
> 
> No peers with id=0 and/or uname=n2 to purge from the membership cache
> Removing all n2 attributes for peer n3
> Removing all n2 attributes for peer n1
> Instructing peers to remove references to node n2/0
> Completed cib_delete operation for section status: OK
> 
> There is nothing in the log file when I add it.
> 
> If fencing is the cause, where should I look for what the cluster tries 
> to do?
> 
> Have you got any other suggestions what to check?
> 
> Best regards,
> Piotr
> 
> On 12.07.2022 12:50, Tomas Jelinek wrote:
>> Hi Piotr,
>>
>> Based on 'pcs cluster node add n2' and 'pcs config' outputs, pcs added 
>> the node to your cluster successfully, that is corosync config has 
>> been modified, distributed and loaded.
>>
>> It looks like the problem is with pacemaker. This is a wild guess, but 
>> maybe pacemaker wants to fence n2, which is not possible, as you 
>> disabled stonith. In the meantime, n1 and n3 do not allow n2 to join, 
>> until it's confirmed fenced. Try looking into / posting 'pcs status 
>> --full' and pacemaker log.
>>
>> With stonith disabled, you have a working cluster (seemingly). Until 
>> you don't, due to an event which requires working stonith for the 
>> cluster to recover.
>>
>> Regards,
>> Tomas
>>
>>
>> Dne 12. 07. 22 v 12:34 Piotr Szafarczyk napsal(a):
>>> Hi,
>>>
>>> I used to have a working cluster with 3 nodes (and stonith disabled). 
>>> After an unexpected restart of one node, the cluster split. The node 
>>> #2 started to see the others as unclean. Nodes 1 and 2 were 
>>> cooperating with each other, showing #2 as offline. There were no 
>>> network connection problems.
>>>
>>> I removed #2 (operating from #1) with
>>> pcs cluster node remove n2
>>>
>>> I verified that it had removed all configuration from #2, both for 
>>> corosync and for pacemaker. The cluster looks like working correctly 
>>> with two nodes (and no traces of #2).
>>>
>>> Now I am trying to add the third node back.
>>> pcs cluster node add n2
>>> Disabling SBD service...
>>> n2: sbd disabled
>>> Sending 'corosync authkey', 'pacemaker authkey' to 'n2'
>>> n2: successful distribution of the file 'corosync authkey'
>>> n2: successful distribution of the file 'pacemaker authkey'
>>> Sending updated corosync.conf to nodes...
>>> n3: Succeeded
>>> n2: Succeeded
>>> n1: Succeeded
>>> n3: Corosync configuration reloaded
>>>
>>> I am able to start #2 operating from #1
>>>
>>> pcs cluster pcsd-status
>>>    n2: Online
>>>    n3: Online
>>>    n1: Online
>>>
>>> pcs cluster enable n2
>>> pcs cluster start n2
>>>
>>> I can see that corosync's configuration has been updated, but 
>>> pacemaker's not.
>>>
>>> _Checking from #1:_
>>>
>>> pcs config
>>> Cluster Name: n
>>> Corosync Nodes:
>>>   n1 n3 n2
>>> Pacemaker Nodes:
>>>   n1 n3
>>> [...]
>>>
>>> pcs status
>>>    * 2 nodes configured
>>> Node List:
>>>    * Online: [ n1 n3 ]
>>> [...]
>>>
>>> pcs cluster cib scope=nodes
>>> <nodes>
>>>    <node id="1" uname="n1"/>
>>>    <node id="3" uname="n3"/>
>>> </nodes>
>>>
>>> _#2 is seeing the state differently:_
>>>
>>> pcs config
>>> Cluster Name: n
>>> Corosync Nodes:
>>>   n1 n3 n2
>>> Pacemaker Nodes:
>>>   n1 n2 n3
>>>
>>> pcs status
>>>    * 3 nodes configured
>>> Node List:
>>>    * Online: [ n2 ]
>>>    * OFFLINE: [ n1 n3 ]
>>> Full List of Resources:
>>>    * No resources
>>> [...]
>>> (there are resources configured on #1 and #3)
>>>
>>> pcs cluster cib scope=nodes
>>> <nodes>
>>>    <node id="1" uname="n1"/>
>>>    <node id="3" uname="n3"/>
>>>    <node id="2" uname="n2"/>
>>> </nodes>
>>>
>>> Help me diagnose it please. Where should I look for the problem? (I 
>>> have already tried a few things more - I see nothing helpful in log 
>>> files, pcs --debug shows nothing suspicious, tried even editing the 
>>> CIB manually)
>>>
>>> Best regards,
>>>
>>> Piotr Szafarczyk
>>>
>>>
>>> _______________________________________________
>>> Manage your subscription:
>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>
>>> ClusterLabs home: https://www.clusterlabs.org/
>>
>> _______________________________________________
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> ClusterLabs home: https://www.clusterlabs.org/
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/