[ClusterLabs] data loss of network would cause Pacemaker exit abnormally
Klaus Wenninger
kwenning at redhat.com
Mon Aug 29 08:37:19 UTC 2016
On 08/28/2016 04:15 AM, chenhj wrote:
> Hi all,
>
> When i use the following command to simulate data lost of network at
> one member of my 3 nodes Pacemaker+Corosync cluster,
> sometimes it cause Pacemaker on another node exit.
>
> tc qdisc add dev eth2 root netem loss 90%
>
> Is there any method to avoid this proleam?
>
> [root at node3 ~]# ps -ef|grep pacemaker
> root 32540 1 0 00:57 ? 00:00:00
> /usr/libexec/pacemaker/lrmd
> 189 32542 1 0 00:57 ? 00:00:00
> /usr/libexec/pacemaker/pengine
> root 33491 11491 0 00:58 pts/1 00:00:00 grep pacemaker
>
> /var/log/cluster/corosync.log
> ------------------------------------------------
> Aug 27 12:33:59 [46855] node3 cib: info:
> cib_process_request: Completed cib_modify operation for section
> status: OK (rc=0, origin=local/attrd/230, version=10.657.19)
> Aug 27 12:33:59 corosync [CPG ] chosen downlist: sender r(0)
> ip(192.168.125.129) ; members(old:2 left:1)
> Aug 27 12:33:59 [46849] node3 pacemakerd: info:
> pcmk_cpg_membership: Node 2172496064 joined group pacemakerd
> (counter=12.0)
> Aug 27 12:33:59 [46849] node3 pacemakerd: info:
> pcmk_cpg_membership: Node 2172496064 still member of group
> pacemakerd (peer=node2, counter=12.0)
> Aug 27 12:33:59 [46849] node3 pacemakerd: info:
> crm_update_peer_proc: pcmk_cpg_membership: Node
> node2[2172496064] - corosync-cpg is now online
> Aug 27 12:33:59 [46849] node3 pacemakerd: info:
> pcmk_cpg_membership: Node 2273159360 still member of group
> pacemakerd (peer=node3, counter=12.1)
> Aug 27 12:33:59 [46849] node3 pacemakerd: info: crm_cs_flush:
> Sent 0 CPG messages (1 remaining, last=19): Try again (6)
> Aug 27 12:33:59 [46849] node3 pacemakerd: info:
> pcmk_cpg_membership: Node 2273159360 left group pacemakerd
> (peer=node3, counter=13.0)
> Aug 27 12:33:59 [46849] node3 pacemakerd: info:
> crm_update_peer_proc: pcmk_cpg_membership: Node
> node3[2273159360] - corosync-cpg is now offline
> Aug 27 12:33:59 [46849] node3 pacemakerd: info:
> pcmk_cpg_membership: Node 2172496064 still member of group
> pacemakerd (peer=node2, counter=13.0)
> Aug 27 12:33:59 [46849] node3 pacemakerd: error:
> pcmk_cpg_membership: We're not part of CPG group 'pacemakerd'
> anymore!
> Aug 27 12:33:59 [46849] node3 pacemakerd: error: pcmk_cpg_dispatch:
> Evicted from CPG membership
> Aug 27 12:33:59 [46849] node3 pacemakerd: error: mcp_cpg_destroy:
> Connection destroyed
> Aug 27 12:33:59 [46849] node3 pacemakerd: info: crm_xml_cleanup:
> Cleaning up memory from libxml2
> Aug 27 12:33:59 [46858] node3 attrd: error: crm_ipc_read:
> Connection to pacemakerd failed
> Aug 27 12:33:59 [46858] node3 attrd: error:
> mainloop_gio_callback: Connection to pacemakerd[0x1255eb0] closed
> (I/O condition=17)
> Aug 27 12:33:59 [46858] node3 attrd: crit: attrd_cs_destroy:
> Lost connection to Corosync service!
> Aug 27 12:33:59 [46858] node3 attrd: notice: main: Exiting...
> Aug 27 12:33:59 [46858] node3 attrd: notice: main:
> Disconnecting client 0x12579a0, pid=46860...
> Aug 27 12:33:59 [46858] node3 attrd: error:
> attrd_cib_connection_destroy: Connection to the CIB terminated...
> Aug 27 12:33:59 corosync [pcmk ] info: pcmk_ipc_exit: Client attrd
> (conn=0x1955f80, async-conn=0x1955f80) left
> Aug 27 12:33:59 [46856] node3 stonith-ng: error: crm_ipc_read:
> Connection to pacemakerd failed
> Aug 27 12:33:59 [46856] node3 stonith-ng: error:
> mainloop_gio_callback: Connection to pacemakerd[0x2314af0] closed
> (I/O condition=17)
> Aug 27 12:33:59 [46856] node3 stonith-ng: error:
> stonith_peer_cs_destroy: Corosync connection terminated
> Aug 27 12:33:59 [46856] node3 stonith-ng: info: stonith_shutdown:
> Terminating with 1 clients
> Aug 27 12:33:59 [46856] node3 stonith-ng: info:
> cib_connection_destroy: Connection to the CIB closed.
> ...
>
> please see corosynclog.txt for detail of log
>
>
> [root at node3 ~]# cat /etc/corosync/corosync.conf
> totem {
> version: 2
> secauth: off
> interface {
> member {
> memberaddr: 192.168.125.134
> }
> member {
> memberaddr: 192.168.125.129
> }
> member {
> memberaddr: 192.168.125.135
> }
>
> ringnumber: 0
> bindnetaddr: 192.168.125.135
> mcastport: 5405
> ttl: 1
> }
> transport: udpu
> }
>
> logging {
> fileline: off
> to_logfile: yes
> to_syslog: no
> logfile: /var/log/cluster/corosync.log
> debug: off
> timestamp: on
> logger_subsys {
> subsys: AMF
> debug: off
> }
> }
>
> service {
> ver: 1
> name: pacemaker
> }
>
> Environment:
> [root at node3 ~]# rpm -q corosync
> corosync-1.4.1-7.el6.x86_64
That is quite old ...
> [root at node3 ~]# cat /etc/redhat-release
> CentOS release 6.3 (Final)
> [root at node3 ~]# pacemakerd -F
> Pacemaker 1.1.14-1.el6 (Build: 70404b0)
and I doubt that many people have tested Pacemaker 1.1.14 against
corosync 1.4.1 ... quite far away from
each other release-wise ...
> Supporting v3.0.10: generated-manpages agent-manpages ascii-docs
> ncurses libqb-logging libqb-ipc nagios corosync-plugin cman acls
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
More information about the Users
mailing list