[ClusterLabs] 32 nodes pacemaker cluster setup issue

Wed May 19 07:17:33 EDT 2021

Hi.

Have you considered using pacemaker-remote instead?

On May 18, 2021 5:55:57 PM S Sathish S <s.s.sathish at ericsson.com> wrote:
> Hi Team,
>
> We are setup 32 nodes pacemaker cluster setup each node has 10 resource so 
> total [around 300+ components] are up and running. While performing 
> installation/update with below task will happen.
>
> From First node we start adding all 31 nodes one-by-one into the cluster 
> and added resource for each nodes.
> we execute pcs command stop/start resource parallelly in some use-case for 
> all nodes.
> If any network related change in node , we kept pcs in maintenance mode and 
> post that network change disable pcs maintenance mode.
> Some case we use to reboot the node one-by-one also for some 
> kernel/application changes to be reflected.
>
> Till 9 node cluster is working fine for us  we don’t see below reported 
> issue , For 32 node cluster setup we are facing below error whenever we 
> perform installation/upgrade with above task is executed.
>
> Please find the coroysnc logs in problematic duration with below error 
> message :
>
> May 17 08:08:47 [1978] node1  corosync notice  [TOTEM ] A new membership 
> (10.61.78.50:85864) was formed. Members left: 2 16 17 31 15 12 13 14 27 28 
> 29 30 20 32 18 7 22 19 24 25 10 5 6 26 23 21 11 3 4
> May 17 08:08:47 [1978] node1  corosync notice  [TOTEM ] Failed to receive 
> the leave message. failed: 2 16 17 31 15 12 13 14 27 28 29 30 20 32 18 7 22 
> 19 24 25 10 5 6 26 23 21 11 3 4
> May 17 08:08:47 [1978] node1  corosync notice  [QUORUM] This node is within 
> the non-primary component and will NOT provide any services.
> May 17 08:08:47 [1978] node1  corosync notice  [QUORUM] Members[1]: 1
> May 17 08:08:47 [1978] node1  corosync notice  [MAIN  ] Completed service 
> synchronization, ready to provide service.
> May 17 11:17:30 [1866] node1  corosync notice  [MAIN  ] Corosync Cluster 
> Engine ('UNKNOWN'): started and ready to provide service.
> May 17 11:17:30 [1866] node1   corosync info    [MAIN  ] Corosync built-in 
> features: pie relro bindnow
> May 17 11:17:30 [1866] node1   corosync warning [MAIN  ] Could not set 
> SCHED_RR at priority 99: Operation not permitted (1)
> May 17 11:17:30 [1866] node1   corosync notice  [TOTEM ] Initializing 
> transport (UDP/IP Unicast).
> May 17 11:17:30 [1866] node1  corosync notice  [TOTEM ] Initializing 
> transmit/receive security (NSS) crypto: none hash: none
> May 17 11:17:30 [1866] node1   corosync notice  [TOTEM ] The network 
> interface [10.61.78.50] is now up.
> May 17 11:17:30 [1866] node1   corosync notice  [SERV  ] Service engine 
> loaded: corosync configuration map access [0]
> May 17 11:17:30 [1866] node1   corosync info    [QB    ] server name: cmap
> May 17 11:17:30 [1866] node1   corosync notice  [SERV  ] Service engine 
> loaded: corosync configuration service [1]
> May 17 11:17:30 [1866] node1   corosync info    [QB    ] server name: cfg
> May 17 11:17:30 [1866] node1   corosync notice  [SERV  ] Service engine 
> loaded: corosync cluster closed process group service v1.01 [2]
> May 17 11:17:30 [1866] node1   corosync info    [QB    ] server name: cpg
> May 17 11:17:30 [1866] node1   corosync notice  [SERV  ] Service engine 
> loaded: corosync profile loading service [4]
> May 17 11:17:30 [1866] node1   corosync notice  [QUORUM] Using quorum 
> provider corosync_votequorum
> May 17 11:17:30 [1866] node1   corosync notice  [SERV  ] Service engine 
> loaded: corosync vote quorum service v1.0 [5]
> May 17 11:17:30 [1866] node1  corosync info    [QB    ] server name: votequorum
> May 17 11:17:30 [1866] node1  corosync notice  [SERV  ] Service engine 
> loaded: corosync cluster quorum service v0.1 [3]
> May 17 11:17:30 [1866] node1  corosync info    [QB    ] server name: quorum
>
> Another node logs :
> May 18 16:20:17 [1968] node2 corosync notice  [TOTEM ] A new membership 
> (10.223.106.11:104056) was formed. Members left: 2 16 17 31 15 12 1 13 14 
> 27 28 29 30 20 7 22 8 9 19 24 25 10 5 6 26 23 11 3 4
> May 18 16:20:17 [1968] node2 corosync notice  [TOTEM ] Failed to receive 
> the leave message. failed: 2 16 17 31 15 12 1 13 14 27 28 29 30 20 7 22 8 9 
> 19 24 25 10 5 6 26 23 11 3 4
> May 18 16:20:17 [1968] node2 corosync notice  [QUORUM] This node is within 
> the non-primary component and will NOT provide any services.
> May 18 16:20:17 [1968] node2 corosync notice  [QUORUM] Members[1]: 32
> May 18 16:20:17 [1968] node2 corosync notice  [MAIN  ] Completed service 
> synchronization, ready to provide service.
> May 18 16:22:20 [1968] node2 corosync notice  [TOTEM ] A new membership 
> (10.217.41.26:104104) was formed. Members joined: 27 29 18
> May 18 16:22:20 [1968] node2 corosync notice  [QUORUM] Members[4]: 27 29 32 18
> May 18 16:22:20 [1968] node2 corosync notice  [MAIN  ] Completed service 
> synchronization, ready to provide service.
> May 18 16:22:45 [1968] node2 corosync notice  [TOTEM ] A new membership 
> (10.217.41.26:104112) was formed. Members
> May 18 16:22:45 [1968] node2 corosync notice  [QUORUM] Members[4]: 27 29 32 18
> May 18 16:22:45 [1968] node2 corosync notice  [MAIN  ] Completed service 
> synchronization, ready to provide service.
> May 18 16:22:46 [1968] node2 corosync notice  [TOTEM ] A new membership 
> (10.217.41.26:104116) was formed. Members joined: 30
> May 18 16:22:46 [1968] node2 corosync notice  [QUORUM] Members[5]: 27 29 30 
> 32 18
>
> Any PCS command will fail with error message on all nodes:
> [root at node1 online]# pcs property set maintenance-mode=false --wait=240
> Error: Unable to update cib
> Call cib_replace failed (-62): Timer expired
> [root at node1 online]#
>
> Workaround : we poweroff all nodes and bring nodes one-by-one to overcome 
> above problem statement , kindly check on this error message and provide us 
> RCA for this problem.
>
>
> Current Pacemaker version :
> pacemaker-2.0.2 -->
> https://github.com/ClusterLabs/pacemaker/tree/Pacemaker-2.0.2
> corosync-2.4.4 -->
> https://github.com/corosync/corosync/tree/v2.4.4
> pcs-0.9.169
>
> Thanks and Regards,
> S Sathish S
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20210519/a8890d3a/attachment.htm>