[ClusterLabs] Cannot add a node with pcs
Tomas Jelinek
tojeline at redhat.com
Tue Aug 2 08:23:49 EDT 2022
Hi Piotr,
Sorry for the delay. I'm not a pacemaker expert, so I don't really know
how pacemaker behaves in various corner cases. Even if I were, it would
be difficult to advise you, since you haven't even posted what version
of pacemaker / corosync / pcs you are using.
In any case, the first thing you need to do is configure stonith.
Properly configured and working stonith is required for a cluster to
operate. There is no way around it.
Regards,
Tomas
Dne 13. 07. 22 v 18:54 Piotr Szafarczyk napsal(a):
> Hi Tomas,
>
> Thank you very much for the idea. I have played with stonith_admin
> --unfence and --confirm. Whenever I try, pcs status show my actions
> under Failed Fencing Actions. I see this in the log file:
>
> error: Unfencing of n2 by <anyone> failed: No such device
>
> No surprise here, since I have not got any devices registered.
>
> If fencing of n2 was a cause, I would expect pcs status to show it as
> offline or unhealthy, but show it. I have got:
>
> * 2 nodes configured
>
> Also I would expect node remove + node clear + node add to make n2 a
> brand new node.
>
> Here are parts of the log when I remove n2 from the cluster
>
> No peers with id=0 and/or uname=n2 to purge from the membership cache
> Removing all n2 attributes for peer n3
> Removing all n2 attributes for peer n1
> Instructing peers to remove references to node n2/0
> Completed cib_delete operation for section status: OK
>
> There is nothing in the log file when I add it.
>
> If fencing is the cause, where should I look for what the cluster tries
> to do?
>
> Have you got any other suggestions what to check?
>
> Best regards,
> Piotr
>
> On 12.07.2022 12:50, Tomas Jelinek wrote:
>> Hi Piotr,
>>
>> Based on 'pcs cluster node add n2' and 'pcs config' outputs, pcs added
>> the node to your cluster successfully, that is corosync config has
>> been modified, distributed and loaded.
>>
>> It looks like the problem is with pacemaker. This is a wild guess, but
>> maybe pacemaker wants to fence n2, which is not possible, as you
>> disabled stonith. In the meantime, n1 and n3 do not allow n2 to join,
>> until it's confirmed fenced. Try looking into / posting 'pcs status
>> --full' and pacemaker log.
>>
>> With stonith disabled, you have a working cluster (seemingly). Until
>> you don't, due to an event which requires working stonith for the
>> cluster to recover.
>>
>> Regards,
>> Tomas
>>
>>
>> Dne 12. 07. 22 v 12:34 Piotr Szafarczyk napsal(a):
>>> Hi,
>>>
>>> I used to have a working cluster with 3 nodes (and stonith disabled).
>>> After an unexpected restart of one node, the cluster split. The node
>>> #2 started to see the others as unclean. Nodes 1 and 2 were
>>> cooperating with each other, showing #2 as offline. There were no
>>> network connection problems.
>>>
>>> I removed #2 (operating from #1) with
>>> pcs cluster node remove n2
>>>
>>> I verified that it had removed all configuration from #2, both for
>>> corosync and for pacemaker. The cluster looks like working correctly
>>> with two nodes (and no traces of #2).
>>>
>>> Now I am trying to add the third node back.
>>> pcs cluster node add n2
>>> Disabling SBD service...
>>> n2: sbd disabled
>>> Sending 'corosync authkey', 'pacemaker authkey' to 'n2'
>>> n2: successful distribution of the file 'corosync authkey'
>>> n2: successful distribution of the file 'pacemaker authkey'
>>> Sending updated corosync.conf to nodes...
>>> n3: Succeeded
>>> n2: Succeeded
>>> n1: Succeeded
>>> n3: Corosync configuration reloaded
>>>
>>> I am able to start #2 operating from #1
>>>
>>> pcs cluster pcsd-status
>>> n2: Online
>>> n3: Online
>>> n1: Online
>>>
>>> pcs cluster enable n2
>>> pcs cluster start n2
>>>
>>> I can see that corosync's configuration has been updated, but
>>> pacemaker's not.
>>>
>>> _Checking from #1:_
>>>
>>> pcs config
>>> Cluster Name: n
>>> Corosync Nodes:
>>> n1 n3 n2
>>> Pacemaker Nodes:
>>> n1 n3
>>> [...]
>>>
>>> pcs status
>>> * 2 nodes configured
>>> Node List:
>>> * Online: [ n1 n3 ]
>>> [...]
>>>
>>> pcs cluster cib scope=nodes
>>> <nodes>
>>> <node id="1" uname="n1"/>
>>> <node id="3" uname="n3"/>
>>> </nodes>
>>>
>>> _#2 is seeing the state differently:_
>>>
>>> pcs config
>>> Cluster Name: n
>>> Corosync Nodes:
>>> n1 n3 n2
>>> Pacemaker Nodes:
>>> n1 n2 n3
>>>
>>> pcs status
>>> * 3 nodes configured
>>> Node List:
>>> * Online: [ n2 ]
>>> * OFFLINE: [ n1 n3 ]
>>> Full List of Resources:
>>> * No resources
>>> [...]
>>> (there are resources configured on #1 and #3)
>>>
>>> pcs cluster cib scope=nodes
>>> <nodes>
>>> <node id="1" uname="n1"/>
>>> <node id="3" uname="n3"/>
>>> <node id="2" uname="n2"/>
>>> </nodes>
>>>
>>> Help me diagnose it please. Where should I look for the problem? (I
>>> have already tried a few things more - I see nothing helpful in log
>>> files, pcs --debug shows nothing suspicious, tried even editing the
>>> CIB manually)
>>>
>>> Best regards,
>>>
>>> Piotr Szafarczyk
>>>
>>>
>>> _______________________________________________
>>> Manage your subscription:
>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>
>>> ClusterLabs home: https://www.clusterlabs.org/
>>
>> _______________________________________________
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> ClusterLabs home: https://www.clusterlabs.org/
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
More information about the Users
mailing list