[ClusterLabs] 32 nodes pacemaker cluster setup issue

Digimer lists at alteeve.ca
Tue May 18 11:33:24 EDT 2021


On 2021-05-18 10:49 a.m., S Sathish S wrote:
> Hi Team,
> 
>  
> 
> We are setup 32 nodes pacemaker cluster setup each node has 10 resource
> so total [around 300+ components] are up and running. While performing
> installation/update with below task will happen.
> 
>  
> 
>   * From First node we start adding all 31 nodes one-by-one into the
>     cluster and added resource for each nodes.
>   * we execute pcs command stop/start resource parallelly in some
>     use-case for all nodes.
>   * If any network related change in node , we kept pcs in maintenance
>     mode and post that network change disable pcs maintenance mode.
>   * Some case we use to reboot the node one-by-one also for some
>     kernel/application changes to be reflected.
> 
>  
> 
> Till 9 node cluster is working fine for us  we don’t see below reported
> issue , For 32 node cluster setup we are facing below error whenever we
> perform installation/upgrade with above task is executed.
> 
>  
> 
> *Please find the coroysnc logs in problematic duration with below error
> message* :
> 
> May 17 08:08:47 [1978] node1  corosync notice  [TOTEM ] A new membership
> (10.61.78.50:85864) was formed. Members left: 2 16 17 31 15 12 13 14 27
> 28 29 30 20 32 18 7 22 19 24 25 10 5 6 26 23 21 11 3 4
> 
> May 17 08:08:47 [1978] node1  corosync notice  [TOTEM ] Failed to
> receive the leave message. failed: 2 16 17 31 15 12 13 14 27 28 29 30 20
> 32 18 7 22 19 24 25 10 5 6 26 23 21 11 3 4
> 
> May 17 08:08:47 [1978] node1  corosync notice  [QUORUM] This node is
> within the non-primary component and will NOT provide any services.
> 
> May 17 08:08:47 [1978] node1  corosync notice  [QUORUM] Members[1]: 1
> 
> May 17 08:08:47 [1978] node1  corosync notice  [MAIN  ] Completed
> service synchronization, ready to provide service.
> 
> May 17 11:17:30 [1866] node1  corosync notice  [MAIN  ] Corosync Cluster
> Engine ('UNKNOWN'): started and ready to provide service.
> 
> May 17 11:17:30 [1866] node1   corosync info    [MAIN  ] Corosync
> built-in features: pie relro bindnow
> 
> May 17 11:17:30 [1866] node1   corosync warning [MAIN  ] Could not set
> SCHED_RR at priority 99: Operation not permitted (1)
> 
> May 17 11:17:30 [1866] node1   corosync notice  [TOTEM ] Initializing
> transport (UDP/IP Unicast).
> 
> May 17 11:17:30 [1866] node1  corosync notice  [TOTEM ] Initializing
> transmit/receive security (NSS) crypto: none hash: none
> 
> May 17 11:17:30 [1866] node1   corosync notice  [TOTEM ] The network
> interface [10.61.78.50] is now up.
> 
> May 17 11:17:30 [1866] node1   corosync notice  [SERV  ] Service engine
> loaded: corosync configuration map access [0]
> 
> May 17 11:17:30 [1866] node1   corosync info    [QB    ] server name: cmap
> 
> May 17 11:17:30 [1866] node1   corosync notice  [SERV  ] Service engine
> loaded: corosync configuration service [1]
> 
> May 17 11:17:30 [1866] node1   corosync info    [QB    ] server name: cfg
> 
> May 17 11:17:30 [1866] node1   corosync notice  [SERV  ] Service engine
> loaded: corosync cluster closed process group service v1.01 [2]
> 
> May 17 11:17:30 [1866] node1   corosync info    [QB    ] server name: cpg
> 
> May 17 11:17:30 [1866] node1   corosync notice  [SERV  ] Service engine
> loaded: corosync profile loading service [4]
> 
> May 17 11:17:30 [1866] node1   corosync notice  [QUORUM] Using quorum
> provider corosync_votequorum
> 
> May 17 11:17:30 [1866] node1   corosync notice  [SERV  ] Service engine
> loaded: corosync vote quorum service v1.0 [5]
> 
> May 17 11:17:30 [1866] node1  corosync info    [QB    ] server name:
> votequorum
> 
> May 17 11:17:30 [1866] node1  corosync notice  [SERV  ] Service engine
> loaded: corosync cluster quorum service v0.1 [3]
> 
> May 17 11:17:30 [1866] node1  corosync info    [QB    ] server name: quorum
> 
>  
> 
> Another node logs :
> 
> May 18 16:20:17 [1968] node2 corosync notice  [TOTEM ] A new membership
> (10.223.106.11:104056) was formed. Members left: 2 16 17 31 15 12 1 13
> 14 27 28 29 30 20 7 22 8 9 19 24 25 10 5 6 26 23 11 3 4
> 
> May 18 16:20:17 [1968] node2 corosync notice  [TOTEM ] Failed to receive
> the leave message. failed: 2 16 17 31 15 12 1 13 14 27 28 29 30 20 7 22
> 8 9 19 24 25 10 5 6 26 23 11 3 4
> 
> May 18 16:20:17 [1968] node2 corosync notice  [QUORUM] This node is
> within the non-primary component and will NOT provide any services.
> 
> May 18 16:20:17 [1968] node2 corosync notice  [QUORUM] Members[1]: 32
> 
> May 18 16:20:17 [1968] node2 corosync notice  [MAIN  ] Completed service
> synchronization, ready to provide service.
> 
> May 18 16:22:20 [1968] node2 corosync notice  [TOTEM ] A new membership
> (10.217.41.26:104104) was formed. Members joined: 27 29 18
> 
> May 18 16:22:20 [1968] node2 corosync notice  [QUORUM] Members[4]: 27 29
> 32 18
> 
> May 18 16:22:20 [1968] node2 corosync notice  [MAIN  ] Completed service
> synchronization, ready to provide service.
> 
> May 18 16:22:45 [1968] node2 corosync notice  [TOTEM ] A new membership
> (10.217.41.26:104112) was formed. Members
> 
> May 18 16:22:45 [1968] node2 corosync notice  [QUORUM] Members[4]: 27 29
> 32 18
> 
> May 18 16:22:45 [1968] node2 corosync notice  [MAIN  ] Completed service
> synchronization, ready to provide service.
> 
> May 18 16:22:46 [1968] node2 corosync notice  [TOTEM ] A new membership
> (10.217.41.26:104116) was formed. Members joined: 30
> 
> May 18 16:22:46 [1968] node2 corosync notice  [QUORUM] Members[5]: 27 29
> 30 32 18
> 
>  
> 
> *Any PCS command will fail with error message on all nodes:*
> 
> [root at node1 online]# pcs property set maintenance-mode=false --wait=240
> Error: Unable to update cib
> Call cib_replace failed (-62): Timer expired
> 
> [root at node1 online]#
> 
>  
> 
> *Workaround *: we poweroff all nodes and bring nodes one-by-one to
> overcome above problem statement , kindly check on this error message
> and provide us RCA for this problem.
> 
>  
> 
>  
> 
> *Current Pacemaker version* :
> 
> pacemaker-2.0.2 -->
> https://github.com/ClusterLabs/pacemaker/tree/Pacemaker-2.0.2
> <https://github.com/ClusterLabs/pacemaker/tree/Pacemaker-2.0.2>
> 
> corosync-2.4.4 -->  https://github.com/corosync/corosync/tree/v2.4.4
> <https://github.com/corosync/corosync/tree/v2.4.4>
> 
> pcs-0.9.169
> 
>  
> 
> Thanks and Regards,
> 
> S Sathish S

As I understand it, clusters over 16 nodes are generally discouraged.
When you do build large clusters, the time needed to sync the CIB (and
handle other messaging) can become too lengthy.

Have you played with corosync / totem timing values? May need to
increase them. Are you using unicast or multicast? What is the CPU load
like on the nodes when these issues arise?

-- 
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of
Einstein’s brain than in the near certainty that people of equal talent
have lived and died in cotton fields and sweatshops." - Stephen Jay Gould


More information about the Users mailing list