[ClusterLabs] continous QUORUM messages in a 3-node cluster

Mon Oct 12 04:06:01 EDT 2015

Illia,

> Hi,
>
> We are using a 3-node pacemaker/corosync cluster on CentOs 7.
> We have several identical setups in our QA/DEV orgs, and a couple of them continuously spew the following messages on all 3 nodes:
>
> Oct  8 17:18:20 42-hw-rig4-L3-2 corosync[15105]: [TOTEM ] A new membership (10.1.13.134:1553572) was formed. Members
> Oct  8 17:18:20 42-hw-rig4-L3-2 corosync[15105]: [QUORUM] Members[3]: 2 1 3
> Oct  8 17:18:20 42-hw-rig4-L3-2 corosync[15105]: [MAIN  ] Completed service synchronization, ready to provide service.
> Oct  8 17:18:22 42-hw-rig4-L3-2 corosync[15105]: [TOTEM ] A new membership (10.1.13.134:1553576) was formed. Members
> Oct  8 17:18:22 42-hw-rig4-L3-2 corosync[15105]: [QUORUM] Members[3]: 2 1 3
> Oct  8 17:18:22 42-hw-rig4-L3-2 corosync[15105]: [MAIN  ] Completed service synchronization, ready to provide service.
> Oct  8 17:18:24 42-hw-rig4-L3-2 corosync[15105]: [TOTEM ] A new membership (10.1.13.134:1553580) was formed. Members
> Oct  8 17:18:24 42-hw-rig4-L3-2 corosync[15105]: [QUORUM] Members[3]: 2 1 3
> Oct  8 17:18:24 42-hw-rig4-L3-2 corosync[15105]: [MAIN  ] Completed service synchronization, ready to provide service.
> Oct  8 17:18:26 42-hw-rig4-L3-2 corosync[15105]: [TOTEM ] A new membership (10.1.13.134:1553584) was formed. Members
> Oct  8 17:18:26 42-hw-rig4-L3-2 corosync[15105]: [QUORUM] Members[3]: 2 1 3
> Oct  8 17:18:26 42-hw-rig4-L3-2 corosync[15105]: [MAIN  ] Completed service synchronization, ready to provide service.
>
> The cluster seems to be generally happy:
>
> root at 42-hw-rig4-L3-2 ~]# pcs cluster status
> Cluster Status:
>   Last updated: Thu Oct  8 17:24:02 2015
>   Last change: Thu Oct  8 16:46:57 2015
>   Stack: corosync
>   Current DC: dq-ceph9.clearsky-data.net (3) - partition with quorum
>   Version: 1.1.12-a14efad
>   3 Nodes configured
>   17 Resources configured
>
> PCSD Status:
>    42-hw-back-1.clearsky-data.net: Online
>    41-hw-back-1.clearsky-data.net: Online
>    dq-ceph9.clearsky-data.net: Online
>
> The corosync config is:
>
> totem {
> version: 2
> secauth: off
> cluster_name: L3_cluster
> transport: udpu
> }
>
> nodelist {
>    node {
>          ring0_addr: 42-hw-back-1.clearsky-data.net
>          nodeid: 1
>         }
>    node {
>          ring0_addr: 41-hw-back-1.clearsky-data.net
>          nodeid: 2
>         }
>    node {
>          ring0_addr: dq-ceph9.clearsky-data.net
>          nodeid: 3
>         }
> }
>
> quorum {
> provider: corosync_votequorum
>
> }
>
> logging {
> to_syslog: yes
> debug : off
> }
>
> What do these messages mean, and how can we stop them?
>
> Any help would be very appreciated.
>
> Ilia Sokolinski
>
> PS
>
> I have tried to enable corosync debug and got the following logs:
>
> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [CFG   ] Config reload requested by node 1
> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [QB    ] HUP conn (15105-46913-25)
> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [QB    ] qb_ipcs_disconnect(15105-46913-25) state:2
> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [QB    ] epoll_ctl(del): Bad file descriptor (9)
> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [MAIN  ] cs_ipcs_connection_closed()
> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [MAIN  ] cs_ipcs_connection_destroyed()
> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [QB    ] Free'ing ringbuffer: /dev/shm/qb-cfg-response-15105-46913-25-header
> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [QB    ] Free'ing ringbuffer: /dev/shm/qb-cfg-event-15105-46913-25-header
> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [QB    ] Free'ing ringbuffer: /dev/shm/qb-cfg-request-15105-46913-25-header
> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [VOTEQ ] got nodeinfo message from cluster node 3
> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [VOTEQ ] nodeinfo message[3]: votes: 1, expected: 3 flags: 1
> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No
> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [VOTEQ ] got nodeinfo message from cluster node 3
> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0
> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [VOTEQ ] got nodeinfo message from cluster node 2
> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [VOTEQ ] nodeinfo message[2]: votes: 1, expected: 3 flags: 1
> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No
> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [VOTEQ ] got nodeinfo message from cluster node 2
> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0
> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [VOTEQ ] got nodeinfo message from cluster node 1
> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [VOTEQ ] nodeinfo message[1]: votes: 1, expected: 3 flags: 1
> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No
> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [VOTEQ ] total_votes=3, expected_votes=3
> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [VOTEQ ] node 1 state=1, votes=1, expected=3
> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [VOTEQ ] node 2 state=1, votes=1, expected=3
> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [VOTEQ ] node 3 state=1, votes=1, expected=3
> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [VOTEQ ] lowest node id: 1 us: 1
> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [VOTEQ ] highest node id: 3 us: 1
> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [VOTEQ ] got nodeinfo message from cluster node 1
> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0
> Oct  8 16:18:48 42-hw-rig4-L3-2 corosync[15105]: [TOTEM ] entering GATHER state from 9(merge during operational state).
> Oct  8 16:18:49 42-hw-rig4-L3-2 corosync[15105]: [QB    ] IPC credentials authenticated (15105-46923-25)
> Oct  8 16:18:49 42-hw-rig4-L3-2 corosync[15105]: [QB    ] connecting to client [46923]
> Oct  8 16:18:49 42-hw-rig4-L3-2 corosync[15105]: [QB    ] shm size:1048589; real_size:1052672; rb->word_size:263168
> Oct  8 16:18:49 42-hw-rig4-L3-2 corosync[15105]: [QB    ] shm size:1048589; real_size:1052672; rb->word_size:263168
> Oct  8 16:18:49 42-hw-rig4-L3-2 corosync[15105]: [QB    ] shm size:1048589; real_size:1052672; rb->word_size:263168
> Oct  8 16:18:49 42-hw-rig4-L3-2 corosync[15105]: [MAIN  ] connection created
> Oct  8 16:18:49 42-hw-rig4-L3-2 corosync[15105]: [CMAP  ] lib_init_fn: conn=0x560430414560
> Oct  8 16:18:49 42-hw-rig4-L3-2 corosync[15105]: [QB    ] HUP conn (15105-46923-25)
> Oct  8 16:18:49 42-hw-rig4-L3-2 corosync[15105]: [QB    ] qb_ipcs_disconnect(15105-46923-25) state:2
> Oct  8 16:18:49 42-hw-rig4-L3-2 corosync[15105]: [QB    ] epoll_ctl(del): Bad file descriptor (9)
> Oct  8 16:18:49 42-hw-rig4-L3-2 corosync[15105]: [MAIN  ] cs_ipcs_connection_closed()
> Oct  8 16:18:49 42-hw-rig4-L3-2 corosync[15105]: [CMAP  ] exit_fn for conn=0x560430414560
> Oct  8 16:18:49 42-hw-rig4-L3-2 corosync[15105]: [MAIN  ] cs_ipcs_connection_destroyed()
> Oct  8 16:18:49 42-hw-rig4-L3-2 corosync[15105]: [QB    ] Free'ing ringbuffer: /dev/shm/qb-cmap-response-15105-46923-25-header
> Oct  8 16:18:49 42-hw-rig4-L3-2 corosync[15105]: [QB    ] Free'ing ringbuffer: /dev/shm/qb-cmap-event-15105-46923-25-header
> Oct  8 16:18:49 42-hw-rig4-L3-2 corosync[15105]: [QB    ] Free'ing ringbuffer: /dev/shm/qb-cmap-request-15105-46923-25-header

Line "entering GATHER state from 9(merge during operational state)". is 
interesting. I believe you have different cluster on the network where 
you forgot ip addr of one of node in affected cluster. As a simple test, 
you can try to add

         interface {
             ringnumber: 0
             mcastport: 5409
         }

into corosync.conf totem section on all nodes in one of affected cluster 
and see if problem disappear.

Honza

> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>