[ClusterLabs] continous QUORUM messages in a 3-node cluster (Jan Friesse)

Ilia Sokolinski ilia at clearskydata.com
Tue Oct 13 11:46:34 EDT 2015


> ------------------------------
> 
> Message: 2
> Date: Mon, 12 Oct 2015 10:06:01 +0200
> From: Jan Friesse <jfriesse at redhat.com>
> To: Cluster Labs - All topics related to open-source clustering
> 	welcomed	<users at clusterlabs.org>
> Subject: Re: [ClusterLabs] continous QUORUM messages in a 3-node
> 	cluster
> Message-ID: <561B69E9.1060202 at redhat.com>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
> 
> Illia,
> 
> 
>> Hi,
>> 
>> We are using a 3-node pacemaker/corosync cluster on CentOs 7.
>> We have several identical setups in our QA/DEV orgs, and a couple of them continuously spew the following messages on all 3 nodes:
>> 
>> Oct  8 17:18:20 42-hw-rig4-L3-2 corosync[15105]: [TOTEM ] A new membership (10.1.13.134:1553572) was formed. Members
>> Oct  8 17:18:20 42-hw-rig4-L3-2 corosync[15105]: [QUORUM] Members[3]: 2 1 3
>> Oct  8 17:18:20 42-hw-rig4-L3-2 corosync[15105]: [MAIN  ] Completed service synchronization, ready to provide service.
>> Oct  8 17:18:22 42-hw-rig4-L3-2 corosync[15105]: [TOTEM ] A new membership (10.1.13.134:1553576) was formed. Members
>> Oct  8 17:18:22 42-hw-rig4-L3-2 corosync[15105]: [QUORUM] Members[3]: 2 1 3
>> Oct  8 17:18:22 42-hw-rig4-L3-2 corosync[15105]: [MAIN  ] Completed service synchronization, ready to provide service.
>> Oct  8 17:18:24 42-hw-rig4-L3-2 corosync[15105]: [TOTEM ] A new membership (10.1.13.134:1553580) was formed. Members
>> Oct  8 17:18:24 42-hw-rig4-L3-2 corosync[15105]: [QUORUM] Members[3]: 2 1 3
>> Oct  8 17:18:24 42-hw-rig4-L3-2 corosync[15105]: [MAIN  ] Completed service synchronization, ready to provide service.
>> Oct  8 17:18:26 42-hw-rig4-L3-2 corosync[15105]: [TOTEM ] A new membership (10.1.13.134:1553584) was formed. Members
>> Oct  8 17:18:26 42-hw-rig4-L3-2 corosync[15105]: [QUORUM] Members[3]: 2 1 3
>> Oct  8 17:18:26 42-hw-rig4-L3-2 corosync[15105]: [MAIN  ] Completed service synchronization, ready to provide service.
>> 
>> The cluster seems to be generally happy:
>> 
>> root at 42-hw-rig4-L3-2 ~]# pcs cluster status
>> Cluster Status:
>>  Last updated: Thu Oct  8 17:24:02 2015
>>  Last change: Thu Oct  8 16:46:57 2015
>>  Stack: corosync
>>  Current DC: dq-ceph9.clearsky-data.net (3) - partition with quorum
>>  Version: 1.1.12-a14efad
>>  3 Nodes configured
>>  17 Resources configured
>> 
>> PCSD Status:
>>   42-hw-back-1.clearsky-data.net: Online
>>   41-hw-back-1.clearsky-data.net: Online
>>   dq-ceph9.clearsky-data.net: Online
>> 
>> The corosync config is:
>> 
>> totem {
>> version: 2
>> secauth: off
>> cluster_name: L3_cluster
>> transport: udpu
>> }
>> 
>> nodelist {
>>   node {
>>         ring0_addr: 42-hw-back-1.clearsky-data.net
>>         nodeid: 1
>>        }
>>   node {
>>         ring0_addr: 41-hw-back-1.clearsky-data.net
>>         nodeid: 2
>>        }
>>   node {
>>         ring0_addr: dq-ceph9.clearsky-data.net
>>         nodeid: 3
>>        }
>> }
>> 
>> quorum {
>> provider: corosync_votequorum
>> 
>> }
>> 
>> logging {
>> to_syslog: yes
>> debug : off
>> }
>> 
>> What do these messages mean, and how can we stop them?
>> 
>> Any help would be very appreciated.
>> 
>> Ilia Sokolinski
>> 
>> PS
>> 
>> I have tried to enable corosync debug and got the following logs:
>> 
>> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [CFG   ] Config reload requested by node 1
>> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [QB    ] HUP conn (15105-46913-25)
>> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [QB    ] qb_ipcs_disconnect(15105-46913-25) state:2
>> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [QB    ] epoll_ctl(del): Bad file descriptor (9)
>> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [MAIN  ] cs_ipcs_connection_closed()
>> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [MAIN  ] cs_ipcs_connection_destroyed()
>> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [QB    ] Free'ing ringbuffer: /dev/shm/qb-cfg-response-15105-46913-25-header
>> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [QB    ] Free'ing ringbuffer: /dev/shm/qb-cfg-event-15105-46913-25-header
>> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [QB    ] Free'ing ringbuffer: /dev/shm/qb-cfg-request-15105-46913-25-header
>> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [VOTEQ ] got nodeinfo message from cluster node 3
>> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [VOTEQ ] nodeinfo message[3]: votes: 1, expected: 3 flags: 1
>> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No
>> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [VOTEQ ] got nodeinfo message from cluster node 3
>> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0
>> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [VOTEQ ] got nodeinfo message from cluster node 2
>> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [VOTEQ ] nodeinfo message[2]: votes: 1, expected: 3 flags: 1
>> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No
>> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [VOTEQ ] got nodeinfo message from cluster node 2
>> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0
>> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [VOTEQ ] got nodeinfo message from cluster node 1
>> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [VOTEQ ] nodeinfo message[1]: votes: 1, expected: 3 flags: 1
>> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [VOTEQ ] flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No
>> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [VOTEQ ] total_votes=3, expected_votes=3
>> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [VOTEQ ] node 1 state=1, votes=1, expected=3
>> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [VOTEQ ] node 2 state=1, votes=1, expected=3
>> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [VOTEQ ] node 3 state=1, votes=1, expected=3
>> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [VOTEQ ] lowest node id: 1 us: 1
>> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [VOTEQ ] highest node id: 3 us: 1
>> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [VOTEQ ] got nodeinfo message from cluster node 1
>> Oct  8 16:18:47 42-hw-rig4-L3-2 corosync[15105]: [VOTEQ ] nodeinfo message[0]: votes: 0, expected: 0 flags: 0
>> Oct  8 16:18:48 42-hw-rig4-L3-2 corosync[15105]: [TOTEM ] entering GATHER state from 9(merge during operational state).
>> Oct  8 16:18:49 42-hw-rig4-L3-2 corosync[15105]: [QB    ] IPC credentials authenticated (15105-46923-25)
>> Oct  8 16:18:49 42-hw-rig4-L3-2 corosync[15105]: [QB    ] connecting to client [46923]
>> Oct  8 16:18:49 42-hw-rig4-L3-2 corosync[15105]: [QB    ] shm size:1048589; real_size:1052672; rb->word_size:263168
>> Oct  8 16:18:49 42-hw-rig4-L3-2 corosync[15105]: [QB    ] shm size:1048589; real_size:1052672; rb->word_size:263168
>> Oct  8 16:18:49 42-hw-rig4-L3-2 corosync[15105]: [QB    ] shm size:1048589; real_size:1052672; rb->word_size:263168
>> Oct  8 16:18:49 42-hw-rig4-L3-2 corosync[15105]: [MAIN  ] connection created
>> Oct  8 16:18:49 42-hw-rig4-L3-2 corosync[15105]: [CMAP  ] lib_init_fn: conn=0x560430414560
>> Oct  8 16:18:49 42-hw-rig4-L3-2 corosync[15105]: [QB    ] HUP conn (15105-46923-25)
>> Oct  8 16:18:49 42-hw-rig4-L3-2 corosync[15105]: [QB    ] qb_ipcs_disconnect(15105-46923-25) state:2
>> Oct  8 16:18:49 42-hw-rig4-L3-2 corosync[15105]: [QB    ] epoll_ctl(del): Bad file descriptor (9)
>> Oct  8 16:18:49 42-hw-rig4-L3-2 corosync[15105]: [MAIN  ] cs_ipcs_connection_closed()
>> Oct  8 16:18:49 42-hw-rig4-L3-2 corosync[15105]: [CMAP  ] exit_fn for conn=0x560430414560
>> Oct  8 16:18:49 42-hw-rig4-L3-2 corosync[15105]: [MAIN  ] cs_ipcs_connection_destroyed()
>> Oct  8 16:18:49 42-hw-rig4-L3-2 corosync[15105]: [QB    ] Free'ing ringbuffer: /dev/shm/qb-cmap-response-15105-46923-25-header
>> Oct  8 16:18:49 42-hw-rig4-L3-2 corosync[15105]: [QB    ] Free'ing ringbuffer: /dev/shm/qb-cmap-event-15105-46923-25-header
>> Oct  8 16:18:49 42-hw-rig4-L3-2 corosync[15105]: [QB    ] Free'ing ringbuffer: /dev/shm/qb-cmap-request-15105-46923-25-header
> 
> Line "entering GATHER state from 9(merge during operational state)". is 
> interesting. I believe you have different cluster on the network where 
> you forgot ip addr of one of node in affected cluster. As a simple test, 
> you can try to add
> 
>         interface {
>             ringnumber: 0
>             mcastport: 5409
>         }
> 
> into corosync.conf totem section on all nodes in one of affected cluster 
> and see if problem disappear.
> 
> Honza

Thanks you very much!
I found a couple of old nodes which still were configured to be members of this cluster, and destroyed pcs config on them (pcs cluster destroy —force)
The log stopped immediatly.

Ilia



More information about the Users mailing list