[ClusterLabs] data loss of network would cause Pacemaker exit abnormally

Mon Aug 29 14:09:22 UTC 2016

On 08/27/2016 09:15 PM, chenhj wrote:
> Hi all,
> 
> When i use the following command to simulate data lost of network at one
> member of my 3 nodes Pacemaker+Corosync cluster,
> sometimes it cause Pacemaker on another node exit.
> 
>   tc qdisc add dev eth2 root netem loss 90%
> 
> Is there any method to avoid this proleam?
> 
> [root at node3 ~]# ps -ef|grep pacemaker
> root      32540      1  0 00:57 ?        00:00:00
> /usr/libexec/pacemaker/lrmd
> 189       32542      1  0 00:57 ?        00:00:00
> /usr/libexec/pacemaker/pengine
> root      33491  11491  0 00:58 pts/1    00:00:00 grep pacemaker
> 
> /var/log/cluster/corosync.log 
> ------------------------------------------------
> Aug 27 12:33:59 [46855] node3        cib:     info: cib_process_request:
>        Completed cib_modify operation for section status: OK (rc=0,
> origin=local/attrd/230, version=10.657.19)
> Aug 27 12:33:59 corosync [CPG   ] chosen downlist: sender r(0)
> ip(192.168.125.129) ; members(old:2 left:1)
> Aug 27 12:33:59 [46849] node3 pacemakerd:     info: pcmk_cpg_membership:
>        Node 2172496064 joined group pacemakerd (counter=12.0)
> Aug 27 12:33:59 [46849] node3 pacemakerd:     info: pcmk_cpg_membership:
>        Node 2172496064 still member of group pacemakerd (peer=node2,
> counter=12.0)
> Aug 27 12:33:59 [46849] node3 pacemakerd:     info:
> crm_update_peer_proc:       pcmk_cpg_membership: Node node2[2172496064]
> - corosync-cpg is now online
> Aug 27 12:33:59 [46849] node3 pacemakerd:     info: pcmk_cpg_membership:
>        Node 2273159360 still member of group pacemakerd (peer=node3,
> counter=12.1)
> Aug 27 12:33:59 [46849] node3 pacemakerd:     info: crm_cs_flush:      
> Sent 0 CPG messages  (1 remaining, last=19): Try again (6)
> Aug 27 12:33:59 [46849] node3 pacemakerd:     info: pcmk_cpg_membership:
>        Node 2273159360 left group pacemakerd (peer=node3, counter=13.0)
> Aug 27 12:33:59 [46849] node3 pacemakerd:     info:
> crm_update_peer_proc:       pcmk_cpg_membership: Node node3[2273159360]
> - corosync-cpg is now offline
> Aug 27 12:33:59 [46849] node3 pacemakerd:     info: pcmk_cpg_membership:
>        Node 2172496064 still member of group pacemakerd (peer=node2,
> counter=13.0)
> Aug 27 12:33:59 [46849] node3 pacemakerd:    error: pcmk_cpg_membership:
>        We're not part of CPG group 'pacemakerd' anymore!
> Aug 27 12:33:59 [46849] node3 pacemakerd:    error: pcmk_cpg_dispatch:
>  Evicted from CPG membership

>From the above, I suspect that the node with the network loss was the
DC, and from its point of view, it was the other node that went away.

Proper quorum and fencing configuration should prevent this from being
an issue. Once the one node sees heavy network loss, the other node(s)
should fence it before it causes too many problems.

> Aug 27 12:33:59 [46849] node3 pacemakerd:    error: mcp_cpg_destroy:  
>  Connection destroyed
> Aug 27 12:33:59 [46849] node3 pacemakerd:     info: crm_xml_cleanup:  
>  Cleaning up memory from libxml2
> Aug 27 12:33:59 [46858] node3      attrd:    error: crm_ipc_read:      
> Connection to pacemakerd failed
> Aug 27 12:33:59 [46858] node3      attrd:    error:
> mainloop_gio_callback:      Connection to pacemakerd[0x1255eb0] closed
> (I/O condition=17)
> Aug 27 12:33:59 [46858] node3      attrd:     crit: attrd_cs_destroy:  
> Lost connection to Corosync service!
> Aug 27 12:33:59 [46858] node3      attrd:   notice: main:       Exiting...
> Aug 27 12:33:59 [46858] node3      attrd:   notice: main:      
> Disconnecting client 0x12579a0, pid=46860...
> Aug 27 12:33:59 [46858] node3      attrd:    error:
> attrd_cib_connection_destroy:       Connection to the CIB terminated...
> Aug 27 12:33:59 corosync [pcmk  ] info: pcmk_ipc_exit: Client attrd
> (conn=0x1955f80, async-conn=0x1955f80) left
> Aug 27 12:33:59 [46856] node3 stonith-ng:    error: crm_ipc_read:      
> Connection to pacemakerd failed
> Aug 27 12:33:59 [46856] node3 stonith-ng:    error:
> mainloop_gio_callback:      Connection to pacemakerd[0x2314af0] closed
> (I/O condition=17)
> Aug 27 12:33:59 [46856] node3 stonith-ng:    error:
> stonith_peer_cs_destroy:    Corosync connection terminated
> Aug 27 12:33:59 [46856] node3 stonith-ng:     info: stonith_shutdown:  
> Terminating with  1 clients
> Aug 27 12:33:59 [46856] node3 stonith-ng:     info:
> cib_connection_destroy:     Connection to the CIB closed.
> ...
> 
> please see corosynclog.txt for detail of log 
> 
> 
> [root at node3 ~]# cat /etc/corosync/corosync.conf
> totem {
>        version: 2
>        secauth: off
>        interface {
>                member {
>                        memberaddr: 192.168.125.134
>                }
>                member {
>                        memberaddr: 192.168.125.129
>                }
>                member {
>                        memberaddr: 192.168.125.135
>                }
> 
>                ringnumber: 0
>                bindnetaddr: 192.168.125.135
>                mcastport: 5405
>                ttl: 1
>        }
>        transport: udpu
> }
> 
> logging {
>        fileline: off
>        to_logfile: yes
>        to_syslog: no
>        logfile: /var/log/cluster/corosync.log
>        debug: off
>        timestamp: on
>        logger_subsys {
>                subsys: AMF
>                debug: off
>        }
> }
> 
> service {
>        ver: 1
>        name: pacemaker
> }
> 
> Environment:
> [root at node3 ~]# rpm -q corosync
> corosync-1.4.1-7.el6.x86_64
> [root at node3 ~]# cat /etc/redhat-release 
> CentOS release 6.3 (Final)
> [root at node3 ~]# pacemakerd -F
> Pacemaker 1.1.14-1.el6 (Build: 70404b0)
>  Supporting v3.0.10:  generated-manpages agent-manpages ascii-docs
> ncurses libqb-logging libqb-ipc nagios  corosync-plugin cman acls
> 
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>