[ClusterLabs] [Problem and Question] If there are too many resources, pacemaker-controld restarts when re-Probe is executed.

Thu May 17 16:45:34 EDT 2018

Hi All, 

I have built the following environment.
 * RHEL7.3 at KVM
 * libqb-1.0.2
 * corosync 2.4.4
 * pacemaker 2.0-rc4

Start up the cluster and pour crm files with 180 Dummy resources.
Node 3 will not start.

--------------
[root at rh73-01 ~]# crm_mon -1                                    
Stack: corosync
Current DC: rh73-01 (version 2.0.0-3aa2fced22) - partition with quorum
Last updated: Thu May 17 18:44:39 2018
Last change: Thu May 17 18:44:18 2018 by root via cibadmin on rh73-01
 2 nodes configured
180 resources configured
 Online: [ rh73-01 rh73-02 ]
 Active resources:
 Resource Group: grpJOS1
 prmDummy1  (ocf::pacemaker:Dummy): Started rh73-01
(snip)

 prmDummy140        (ocf::pacemaker:Dummy): Started rh73-01
(snip)
 prmDummy160        (ocf::pacemaker:Dummy): Started rh73-02

--------------

Execute crm_resource -R after 120 resources are started on the clustern.
--------------
[root at rh73-01 ~]# crm_resource -R      
Waiting for 1 replies from the controller. OK
--------------

I tried the following 3 patterns.

*******************

Pattern 1) When /etc/sysconfig/pacemaker is set as follows.
--------------@/etc/sysconfig/pacemaker
PCMK_logfacility=local1
PCMK_logpriority=info
--------------

After a while, the DC node crmd fails to recover and restarts the difference. 

[root at rh73-01 ~]# ps -ef |grep pace            
root      6751     1  0 18:43 ?        00:00:00 /usr/sbin/pacemakerd -f                                                                                                                                          
haclust+  6752  6751  2 18:43 ?        00:00:16 /usr/libexec/pacemaker/pacemaker-based                                                                                                                          
root      6753  6751  0 18:43 ?        00:00:01 /usr/libexec/pacemaker/pacemaker-fenced                                                                                                                          
root      6754  6751  0 18:43 ?        00:00:02 /usr/libexec/pacemaker/pacemaker-execd                                                                                                                          
haclust+  6755  6751  0 18:43 ?        00:00:00 /usr/libexec/pacemaker/pacemaker-attrd                                                                                                                          
haclust+  6756  6751  0 18:43 ?        00:00:00 /usr/libexec/pacemaker/pacemaker-schedulerd                                                                                                                      
haclust+ 20478  6751  0 18:50 ?        00:00:00 /usr/libexec/pacemaker/pacemaker-controld                                                                                                                        
root     25552  1302  0 18:52 pts/0    00:00:00 grep --color=auto pace    

Pattern 2) In order to avoid problems, I made the following settings.
--------------@/etc/sysconfig/pacemaker
PCMK_logfacility=local1
PCMK_logpriority=info
PCMK_cib_timeout=120
PCMK_ipc_buffer=262144
-------------- at crm file.
(snip)
property cib-bootstrap-options: \ cluster-ipc-limit=2000 \
(snip)
-------------- 

Just like pattern 1, after a while, DC node crmd fails to recover and restarts the difference. 

[root at rh73-01 ~]# ps -ef | grep pace
root      3840     1  0 18:57 ?        00:00:00 /usr/sbin/pacemakerd -f                                                                                                                                          
haclust+  3841  3840  3 18:57 ?        00:00:16 /usr/libexec/pacemaker/pacemaker-based                                                                                                                          
root      3842  3840  0 18:57 ?        00:00:01 /usr/libexec/pacemaker/pacemaker-fenced                                                                                                                          
root      3843  3840  0 18:57 ?        00:00:01 /usr/libexec/pacemaker/pacemaker-execd                                                                                                                          
haclust+  3844  3840  0 18:57 ?        00:00:00 /usr/libexec/pacemaker/pacemaker-attrd                                                                                                                          
haclust+  3845  3840  0 18:57 ?        00:00:00 /usr/libexec/pacemaker/pacemaker-schedulerd                                                                                                                      
haclust+  6221  3840  0 19:00 ?        00:00:00 /usr/libexec/pacemaker/pacemaker-controld                                                                                                                        
root     17974  1302  0 19:05 pts/0    00:00:00 grep --color=auto pace                

Pattern 3) In order to avoid problems, I made the following settings. I tried to make only the value of PCMK_ipc_buffer smaller than the default.
--------------@/etc/sysconfig/pacemaker
PCMK_logfacility=local1
PCMK_logpriority=info
PCMK_ipc_buffer=20480
-------------- 

Even after a while, crmd will not restart and the resources of the cluster will be configured. 

[root at rh73-01 ~]# ps -ef | grep pace                                  
root     23511     1  0 19:08 ?        00:00:00 /usr/sbin/pacemakerd -f                                                                                                                                          
haclust+ 23512 23511 16 19:08 ?        00:00:19 /usr/libexec/pacemaker/pacemaker-based                                                                                                                          
root     23513 23511  0 19:08 ?        00:00:01 /usr/libexec/pacemaker/pacemaker-fenced                                                                                                                          
root     23514 23511  0 19:08 ?        00:00:00 /usr/libexec/pacemaker/pacemaker-execd                                                                                                                          
haclust+ 23515 23511  0 19:08 ?        00:00:00 /usr/libexec/pacemaker/pacemaker-attrd                                                                                                                          
haclust+ 23516 23511  3 19:08 ?        00:00:04 /usr/libexec/pacemaker/pacemaker-schedulerd                                                                                                                      
haclust+ 23517 23511 11 19:08 ?        00:00:13 /usr/libexec/pacemaker/pacemaker-controld                                                                                                                        
root     28430  1302  0 19:10 pts/0    00:00:00 grep --color=auto pace        

******************* 

This problem seems to happen with Pacemaker-1.1.18. If PCMK_fail_fast = yes, restarting this crmd will cause the node to reboot. 

If PCMK_ipc_buffer is made small, crmd will not restart properly.

If it gets bigger it will restart, it may be something wrong with Pacemaker.
Is not there something wrong with pacemaker? 

If the number of resources is large, what kind of setting is correct? 

* This content is registered in the following Bugzilla.
- https://bugs.clusterlabs.org/show_bug.cgi?id=5349

Best Regards,
Hideo Yamauchi.