[ClusterLabs] data loss of network would cause Pacemaker exit abnormally

Sat Aug 27 22:15:52 EDT 2016

Hi all,

When i use the following command to simulate data lost of network at one member of my 3 nodes Pacemaker+Corosync cluster,
sometimes it cause Pacemaker on another node exit.

  tc qdisc add dev eth2 root netem loss 90%

Is there any method to avoid this proleam?

[root at node3 ~]# ps -ef|grep pacemaker
root      32540      1  0 00:57 ?        00:00:00 /usr/libexec/pacemaker/lrmd
189       32542      1  0 00:57 ?        00:00:00 /usr/libexec/pacemaker/pengine
root      33491  11491  0 00:58 pts/1    00:00:00 grep pacemaker

/var/log/cluster/corosync.log 
------------------------------------------------
Aug 27 12:33:59 [46855] node3        cib:     info: cib_process_request:        Completed cib_modify operation for section status: OK (rc=0, origin=local/attrd/230, version=10.657.19)
Aug 27 12:33:59 corosync [CPG   ] chosen downlist: sender r(0) ip(192.168.125.129) ; members(old:2 left:1)
Aug 27 12:33:59 [46849] node3 pacemakerd:     info: pcmk_cpg_membership:        Node 2172496064 joined group pacemakerd (counter=12.0)
Aug 27 12:33:59 [46849] node3 pacemakerd:     info: pcmk_cpg_membership:        Node 2172496064 still member of group pacemakerd (peer=node2, counter=12.0)
Aug 27 12:33:59 [46849] node3 pacemakerd:     info: crm_update_peer_proc:       pcmk_cpg_membership: Node node2[2172496064] - corosync-cpg is now online
Aug 27 12:33:59 [46849] node3 pacemakerd:     info: pcmk_cpg_membership:        Node 2273159360 still member of group pacemakerd (peer=node3, counter=12.1)
Aug 27 12:33:59 [46849] node3 pacemakerd:     info: crm_cs_flush:       Sent 0 CPG messages  (1 remaining, last=19): Try again (6)
Aug 27 12:33:59 [46849] node3 pacemakerd:     info: pcmk_cpg_membership:        Node 2273159360 left group pacemakerd (peer=node3, counter=13.0)
Aug 27 12:33:59 [46849] node3 pacemakerd:     info: crm_update_peer_proc:       pcmk_cpg_membership: Node node3[2273159360] - corosync-cpg is now offline
Aug 27 12:33:59 [46849] node3 pacemakerd:     info: pcmk_cpg_membership:        Node 2172496064 still member of group pacemakerd (peer=node2, counter=13.0)
Aug 27 12:33:59 [46849] node3 pacemakerd:    error: pcmk_cpg_membership:        We're not part of CPG group 'pacemakerd' anymore!
Aug 27 12:33:59 [46849] node3 pacemakerd:    error: pcmk_cpg_dispatch:  Evicted from CPG membership
Aug 27 12:33:59 [46849] node3 pacemakerd:    error: mcp_cpg_destroy:    Connection destroyed
Aug 27 12:33:59 [46849] node3 pacemakerd:     info: crm_xml_cleanup:    Cleaning up memory from libxml2
Aug 27 12:33:59 [46858] node3      attrd:    error: crm_ipc_read:       Connection to pacemakerd failed
Aug 27 12:33:59 [46858] node3      attrd:    error: mainloop_gio_callback:      Connection to pacemakerd[0x1255eb0] closed (I/O condition=17)
Aug 27 12:33:59 [46858] node3      attrd:     crit: attrd_cs_destroy:   Lost connection to Corosync service!
Aug 27 12:33:59 [46858] node3      attrd:   notice: main:       Exiting...
Aug 27 12:33:59 [46858] node3      attrd:   notice: main:       Disconnecting client 0x12579a0, pid=46860...
Aug 27 12:33:59 [46858] node3      attrd:    error: attrd_cib_connection_destroy:       Connection to the CIB terminated...
Aug 27 12:33:59 corosync [pcmk  ] info: pcmk_ipc_exit: Client attrd (conn=0x1955f80, async-conn=0x1955f80) left
Aug 27 12:33:59 [46856] node3 stonith-ng:    error: crm_ipc_read:       Connection to pacemakerd failed
Aug 27 12:33:59 [46856] node3 stonith-ng:    error: mainloop_gio_callback:      Connection to pacemakerd[0x2314af0] closed (I/O condition=17)
Aug 27 12:33:59 [46856] node3 stonith-ng:    error: stonith_peer_cs_destroy:    Corosync connection terminated
Aug 27 12:33:59 [46856] node3 stonith-ng:     info: stonith_shutdown:   Terminating with  1 clients
Aug 27 12:33:59 [46856] node3 stonith-ng:     info: cib_connection_destroy:     Connection to the CIB closed.
...

please see corosynclog.txt for detail of log 

[root at node3 ~]# cat /etc/corosync/corosync.conf
totem {
       version: 2
       secauth: off
       interface {
               member {
                       memberaddr: 192.168.125.134
               }
               member {
                       memberaddr: 192.168.125.129
               }
               member {
                       memberaddr: 192.168.125.135
               }

               ringnumber: 0
               bindnetaddr: 192.168.125.135
               mcastport: 5405
               ttl: 1
       }
       transport: udpu
}

logging {
       fileline: off
       to_logfile: yes
       to_syslog: no
       logfile: /var/log/cluster/corosync.log
       debug: off
       timestamp: on
       logger_subsys {
               subsys: AMF
               debug: off
       }
}

service {
       ver: 1
       name: pacemaker
}

Environment:
[root at node3 ~]# rpm -q corosync
corosync-1.4.1-7.el6.x86_64
[root at node3 ~]# cat /etc/redhat-release 
CentOS release 6.3 (Final)
[root at node3 ~]# pacemakerd -F
Pacemaker 1.1.14-1.el6 (Build: 70404b0)
 Supporting v3.0.10:  generated-manpages agent-manpages ascii-docs ncurses libqb-logging libqb-ipc nagios  corosync-plugin cman acls
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20160828/87adda1b/attachment-0002.html>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: corosynclog.txt
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20160828/87adda1b/attachment-0002.txt>