[ClusterLabs] Corosync node gets unique Ring ID
Igor Tverdovskiy
igor.tverdovskiy.pe at gmail.com
Mon Jan 25 21:45:27 EST 2021
Hi All,
> pacemakerd -$
Pacemaker 1.1.15-11.el7
> corosync -v
Corosync Cluster Engine, version '2.4.0'
> rpm -qi libqb
Name : libqb
Version : 1.0.1
Please assist. Recently faced a strange bug (I suppose), when one of the
cluster nodes gets different from others "Ring ID" for example after
corosync config reload , e.g.:
*Affected node:*
============
(target.standby)> sudo corosync-quorumtool
Quorum information
------------------
Date: Tue Jan 26 01:58:54 2021
Quorum provider: corosync_votequorum
Nodes: 5
Node ID: 5
Ring ID: *7/59268* <<<<<<<
Quorate: Yes
Votequorum information
----------------------
Expected votes: 5
Highest expected: 5
Total votes: 5
Quorum: 3
Flags: Quorate
Membership information
----------------------
Nodeid Votes Name
7 1 dispatching-sbc
8 1 dispatching-sbc-2-6
3 1 10.27.77.202
5 1 cassandra-3 (local)
6 1 10.27.77.205
============
*OK nodes:*
> sudo corosync-quorumtool
Quorum information
------------------
Date: Tue Jan 26 01:59:13 2021
Quorum provider: corosync_votequorum
Nodes: 4
Node ID: 8
Ring ID: *7/59300* <<<<<<<
Quorate: Yes
Votequorum information
----------------------
Expected votes: 5
Highest expected: 5
Total votes: 4
Quorum: 3
Flags: Quorate
Membership information
----------------------
Nodeid Votes Name
7 1 10.27.77.106
8 1 10.27.77.107 (local)
3 1 10.27.77.202
6 1 10.27.77.205
============
Also strange is that *crm status shows only two of five nodes* on the
affected node, but at the same time
*"sudo crm_node -l" shows all 5 nodes as members*.
============
(target.standby)> sudo crm_node -l
5 target.standby member
7 target.dsbc1 member
3 target.sip member
8 target.dsbc member
6 target.sec.sip member
-------
(target.standby)> sudo crm status
Stack: corosync
Current DC: target.sip (version 1.1.15-11.el7-e174ec8) - partition with
quorum
Last updated: Tue Jan 26 02:08:02 2021 Last change: Mon Jan 25
14:27:18 2021 by root via crm_node on target.sec.sip
2 nodes and 7 resources configured
Online: [ target.sec.sip target.sip ] <<<<<<
Full list of resources:
============
The issue here is that crm configure operations fail with timeout error:
============
(target.standby)> sudo crm configure property maintenance-mode=true
*Call cib_apply_diff failed (-62): Timer expired*
ERROR: could not patch cib (rc=62)
INFO: offending xml diff: <diff format="2">
<change operation="modify"
path="/cib/configuration/crm_config/cluster_property_set[@id='cib-bootstrap-options']/nvpair[@id='cib-bootstrap-options-maintenance-mode']">
<change-list>
<change-attr name="value" operation="set" value="true"/>
</change-list>
<change-result>
<nvpair name="maintenance-mode" value="true"
id="cib-bootstrap-options-maintenance-mode"/>
</change-result>
</change>
</diff>
============
In the log there are errors that totem is unable to form a cluster:
============
(target.standby)
*First entry seems caused by node reloading (corosync-cfgtool -R):*
Jan 26 01:40:35 [13190] destination-standby corosync notice [CFG ]
Config reload requested by node 7
Jan 26 01:40:35 [13190] destination-standby corosync debug [TOTEM ]
removing dynamic member 10.27.77.106 for ring 0
Jan 26 01:40:35 [13190] destination-standby corosync notice [TOTEM ]
removing UDPU member {10.27.77.106}
Jan 26 01:40:35 [13190] destination-standby corosync debug [TOTEM ]
Closing socket to: {10.27.77.106}
Jan 26 01:40:35 [13190] destination-standby corosync debug [TOTEM ]
removing dynamic member 10.27.77.107 for ring 0
Jan 26 01:40:35 [13190] destination-standby corosync notice [TOTEM ]
removing UDPU member {10.27.77.107}
Jan 26 01:40:35 [13190] destination-standby corosync debug [TOTEM ]
Closing socket to: {10.27.77.107}
Jan 26 01:40:35 [13190] destination-standby corosync debug [TOTEM ]
removing dynamic member 10.27.77.204 for ring 0
Jan 26 01:40:35 [13190] destination-standby corosync notice [TOTEM ]
removing UDPU member {10.27.77.204}
Jan 26 01:40:35 [13190] destination-standby corosync debug [TOTEM ]
Closing socket to: {10.27.77.204}
Jan 26 01:40:35 [13190] destination-standby corosync debug [TOTEM ]
Configuration reloaded. Dumping actual totem config.
Jan 26 01:40:35 [13190] destination-standby corosync debug [TOTEM ] Token
Timeout (5000 ms) retransmit timeout (1190 ms)
Jan 26 01:40:35 [13190] destination-standby corosync debug [TOTEM ] token
hold (942 ms) retransmits before loss (4 retrans)
Jan 26 01:40:35 [13190] destination-standby corosync debug [TOTEM ] join
(50 ms) send_join (0 ms) consensus (6000 ms) merge (200 ms)
Jan 26 01:40:35 [13190] destination-standby corosync debug [TOTEM ]
downcheck (1000 ms) fail to recv const (2500 msgs)
Jan 26 01:40:35 [13190] destination-standby corosync debug [TOTEM ] seqno
unchanged const (30 rotations) Maximum network MTU 1369
Jan 26 01:40:35 [13190] destination-standby corosync debug [TOTEM ]
window size per rotation (50 messages) maximum messages per rotation (17
messages)
Jan 26 01:40:35 [13190] destination-standby corosync debug [TOTEM ]
missed count const (5 messages)
Jan 26 01:40:35 [13190] destination-standby corosync debug [TOTEM ] RRP
token expired timeout (1190 ms)
Jan 26 01:40:35 [13190] destination-standby corosync debug [TOTEM ] RRP
token problem counter (2000 ms)
Jan 26 01:40:35 [13190] destination-standby corosync debug [TOTEM ] RRP
threshold (10 problem count)
Jan 26 01:40:35 [13190] destination-standby corosync debug [TOTEM ] RRP
multicast threshold (100 problem count)
Jan 26 01:40:35 [13190] destination-standby corosync debug [TOTEM ] RRP
automatic recovery check timeout (1000 ms)
Jan 26 01:40:35 [13190] destination-standby corosync debug [TOTEM ] RRP
mode set to none.
Jan 26 01:40:35 [13190] destination-standby corosync debug [TOTEM ]
heartbeat_failures_allowed (0)
Jan 26 01:40:35 [13190] destination-standby corosync debug [TOTEM ]
max_network_delay (50 ms)
Jan 26 01:40:35 [13190] destination-standby corosync debug [VOTEQ ]
Reading configuration (runtime: 1)
Jan 26 01:40:35 [13190] destination-standby corosync debug [VOTEQ ] No
nodelist defined or our node is not in the nodelist
Jan 26 01:40:35 [13190] destination-standby corosync debug [VOTEQ ]
ev_tracking=0, ev_tracking_barrier = 0: expected_votes = 2
Jan 26 01:40:35 [13190] destination-standby corosync debug [VOTEQ ]
flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No
QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No
Jan 26 01:40:35 [13191] destination-standby corosync debug [VOTEQ ] got
nodeinfo message from cluster node 8
Jan 26 01:40:35 [13191] destination-standby corosync debug [VOTEQ ]
nodeinfo message[8]: votes: 1, expected: 5 flags: 1
Jan 26 01:40:35 [13191] destination-standby corosync debug [VOTEQ ]
flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No
QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No
Jan 26 01:40:35 [13191] destination-standby corosync debug [VOTEQ ]
total_votes=5, expected_votes=2
Jan 26 01:40:35 [13191] destination-standby corosync debug [VOTEQ ]
Sending expected votes callback
Jan 26 01:40:35 [13191] destination-standby corosync debug [VOTEQ ] node
3 state=1, votes=1, expected=5
Jan 26 01:40:35 [13191] destination-standby corosync debug [VOTEQ ] node
5 state=1, votes=1, expected=5
Jan 26 01:40:35 [13191] destination-standby corosync debug [VOTEQ ] node
6 state=1, votes=1, expected=5
Jan 26 01:40:35 [13191] destination-standby corosync debug [VOTEQ ] node
7 state=1, votes=1, expected=5
Jan 26 01:40:35 [13191] destination-standby corosync debug [VOTEQ ] node
8 state=1, votes=1, expected=5
Jan 26 01:40:35 [13191] destination-standby corosync debug [VOTEQ ]
lowest node id: 3 us: 5
Jan 26 01:40:35 [13191] destination-standby corosync debug [VOTEQ ]
highest node id: 8 us: 5
Jan 26 01:40:35 [13191] destination-standby corosync debug [VOTEQ ] got
nodeinfo message from cluster node 8
Jan 26 01:40:35 [13191] destination-standby corosync debug [VOTEQ ]
nodeinfo message[0]: votes: 0, expected: 0 flags: 0
Jan 26 01:40:38 [13191] destination-standby corosync error [TOTEM ]
FAILED TO RECEIVE
Jan 26 01:40:38 [13191] destination-standby corosync debug [TOTEM ]
entering GATHER state from 6(failed to receive).
Jan 26 01:40:44 [13191] destination-standby corosync debug [TOTEM ]
entering GATHER state from 0(consensus timeout).
Jan 26 01:40:48 [13240] destination-standby crmd: debug:
throttle_cib_load: cib load: 0.000333 (1 ticks in 30s)
Jan 26 01:40:48 [13240] destination-standby crmd: debug:
throttle_load_avg: Current load is 1.010000 (full: 1.01 0.52 0.45 2/471
10513)
Jan 26 01:40:48 [13240] destination-standby crmd: debug:
throttle_io_load: Current IO load is 0.000000
Jan 26 01:40:55 [13191] destination-standby corosync debug [TOTEM ] The
consensus timeout expired.
Jan 26 01:40:55 [13191] destination-standby corosync debug [TOTEM ]
entering GATHER state from 3(The consensus timeout expired.).
Jan 26 01:41:06 [13191] destination-standby corosync debug [TOTEM ] The
consensus timeout expired.
Jan 26 01:41:06 [13191] destination-standby corosync debug [TOTEM ]
entering GATHER state from 3(The consensus timeout expired.).
Jan 26 01:41:17 [13191] destination-standby corosync debug [TOTEM ] The
consensus timeout expired.
Jan 26 01:41:17 [13191] destination-standby corosync debug [TOTEM ]
entering GATHER state from 3(The consensus timeout expired.).
Jan 26 01:41:18 [13191] destination-standby corosync debug [QB ] IPC
credentials authenticated (13191-14633-16)
Jan 26 01:41:18 [13191] destination-standby corosync debug [QB ]
connecting to client [14633]
Jan 26 01:41:18 [13191] destination-standby corosync debug [QB ] shm
size:1048589; real_size:1052672; rb->word_size:263168
Jan 26 01:41:18 [13191] destination-standby corosync debug [QB ] shm
size:1048589; real_size:1052672; rb->word_size:263168
Jan 26 01:41:18 [13191] destination-standby corosync debug [QB ] shm
size:1048589; real_size:1052672; rb->word_size:263168
Jan 26 01:41:18 [13191] destination-standby corosync debug [MAIN ]
connection created
Jan 26 01:41:18 [13240] destination-standby crmd: debug:
throttle_cib_load: cib load: 0.000000 (0 ticks in 30s)
Jan 26 01:41:18 [13240] destination-standby crmd: debug:
throttle_load_avg: Current load is 1.360000 (full: 1.36 0.64 0.49 1/475
14633)
Jan 26 01:41:18 [13240] destination-standby crmd: debug:
throttle_io_load: Current IO load is 0.000000
Jan 26 01:41:28 [13191] destination-standby corosync debug [TOTEM ] The
consensus timeout expired.
Jan 26 01:41:28 [13191] destination-standby corosync debug [TOTEM ]
entering GATHER state from 3(The consensus timeout expired.).
Jan 26 01:41:28 [13191] destination-standby corosync warning [MAIN ] Totem
is unable to form a cluster because of an operating system or network
fault. The most common cause of this message is that the local firewall is
configured improperly.
Jan 26 01:41:30 [13191] destination-standby corosync warning [MAIN ] Totem
is unable to form a cluster because of an operating system or network
fault. The most common cause of this message is that the local firewall is
configured improperly.
Jan 26 02:10:42 [13191] destination-standby corosync warning [MAIN ] *Totem
is unable to form a cluster because of an operating system or network
fault. The most common cause of this message is that the local firewall is
configured improperly.*
Jan 26 02:10:42 [13235] destination-standby cib: debug:
crm_client_new: Connecting 0x560a4bef1f60 for uid=0 gid=0 pid=6916
id=5a414534-bc62-4544-a5d2-6deb772a6b49
Jan 26 02:10:42 [13235] destination-standby cib: debug:
handle_new_connection: IPC credentials authenticated (13235-6916-13)
Jan 26 02:10:42 [13235] destination-standby cib: debug:
qb_ipcs_shm_connect: connecting to client [6916]
Jan 26 02:10:42 [13235] destination-standby cib: debug:
qb_rb_open_2: shm size:524301; real_size:528384; rb->word_size:132096
Jan 26 02:10:42 [13235] destination-standby cib: debug:
qb_rb_open_2: shm size:524301; real_size:528384; rb->word_size:132096
Jan 26 02:10:42 [13235] destination-standby cib: debug:
qb_rb_open_2: shm size:524301; real_size:528384; rb->word_size:132096
Jan 26 02:10:42 [13235] destination-standby cib: debug:
cib_acl_enabled: CIB ACL is disabled
Jan 26 02:10:42 [13235] destination-standby cib: debug:
qb_ipcs_dispatch_connection_request: HUP conn (13235-6916-13)
Jan 26 02:10:42 [13235] destination-standby cib: debug:
qb_ipcs_disconnect: qb_ipcs_disconnect(13235-6916-13) state:2
Jan 26 02:10:42 [13235] destination-standby cib: debug:
crm_client_destroy: Destroying 0 events
Jan 26 02:10:42 [13235] destination-standby cib: debug:
qb_rb_close_helper: Free'ing ringbuffer:
/dev/shm/qb-cib_rw-response-13235-6916-13-header
Jan 26 02:10:42 [13235] destination-standby cib: debug:
qb_rb_close_helper: Free'ing ringbuffer:
/dev/shm/qb-cib_rw-event-13235-6916-13-header
Jan 26 02:10:42 [13235] destination-standby cib: debug:
qb_rb_close_helper: Free'ing ringbuffer:
/dev/shm/qb-cib_rw-request-13235-6916-13-header
Jan 26 02:10:42 [13240] destination-standby crmd: debug:
crm_client_new: Connecting 0x55ce3bb0e6f0 for uid=0 gid=0 pid=6919
id=3a72b777-daa0-4b0e-acc2-fc58a07f31a6
Jan 26 02:10:42 [13240] destination-standby crmd: debug:
handle_new_connection: IPC credentials authenticated (13240-6919-13)
Jan 26 02:10:42 [13240] destination-standby crmd: debug:
qb_ipcs_shm_connect: connecting to client [6919]
Jan 26 02:10:42 [13240] destination-standby crmd: debug:
qb_rb_open_2: shm size:131085; real_size:135168; rb->word_size:33792
Jan 26 02:10:42 [13240] destination-standby crmd: debug:
qb_rb_open_2: shm size:131085; real_size:135168; rb->word_size:33792
Jan 26 02:10:42 [13240] destination-standby crmd: debug:
qb_rb_open_2: shm size:131085; real_size:135168; rb->word_size:33792
Jan 26 02:10:44 [13191] destination-standby corosync warning [MAIN ] Totem
is unable to form a cluster because of an operating system or network
fault. The most common cause of this message is that the local firewall is
configured improperly.
Jan 26 02:10:44 [13240] destination-standby crmd: debug:
qb_ipcs_dispatch_connection_request: HUP conn (13240-6125-14)
Jan 26 02:10:44 [13240] destination-standby crmd: debug:
qb_ipcs_disconnect: qb_ipcs_disconnect(13240-6125-14) state:2
Jan 26 02:10:44 [13240] destination-standby crmd: debug:
crm_client_destroy: Destroying 0 events
Jan 26 02:10:44 [13240] destination-standby crmd: debug:
qb_rb_close_helper: Free'ing ringbuffer:
/dev/shm/qb-crmd-response-13240-6125-14-header
Jan 26 02:10:44 [13240] destination-standby crmd: debug:
qb_rb_close_helper: Free'ing ringbuffer:
/dev/shm/qb-crmd-event-13240-6125-14-header
Jan 26 02:10:44 [13240] destination-standby crmd: debug:
qb_rb_close_helper: Free'ing ringbuffer:
/dev/shm/qb-crmd-request-13240-6125-14-header
Jan 26 02:10:44 [13240] destination-standby crmd: debug:
crm_client_new: Connecting 0x55ce3bb22960 for uid=0 gid=0 pid=6928
id=039044f3-e674-4afa-9857-47459d1f0d0a
Jan 26 02:10:44 [13240] destination-standby crmd: debug:
handle_new_connection: IPC credentials authenticated (13240-6928-14)
Jan 26 02:10:44 [13240] destination-standby crmd: debug:
qb_ipcs_shm_connect: connecting to client [6928]
Jan 26 02:10:44 [13240] destination-standby crmd: debug:
qb_rb_open_2: shm size:131085; real_size:135168; rb->word_size:33792
Jan 26 02:10:44 [13240] destination-standby crmd: debug:
qb_rb_open_2: shm size:131085; real_size:135168; rb->word_size:33792
Jan 26 02:10:44 [13240] destination-standby crmd: debug:
qb_rb_open_2: shm size:131085; real_size:135168; rb->word_size:33792
Jan 26 02:10:44 [13235] destination-standby cib: debug:
crm_client_new: Connecting 0x560a4bef1f60 for uid=0 gid=0 pid=6928
id=45726483-e7b8-4ed5-8388-c6e8578d3366
Jan 26 02:10:44 [13235] destination-standby cib: debug:
handle_new_connection: IPC credentials authenticated (13235-6928-13)
Jan 26 02:10:44 [13235] destination-standby cib: debug:
qb_ipcs_shm_connect: connecting to client [6928]
Jan 26 02:10:44 [13235] destination-standby cib: debug:
qb_rb_open_2: shm size:524301; real_size:528384; rb->word_size:132096
Jan 26 02:10:44 [13235] destination-standby cib: debug:
qb_rb_open_2: shm size:524301; real_size:528384; rb->word_size:132096
Jan 26 02:10:44 [13235] destination-standby cib: debug:
qb_rb_open_2: shm size:524301; real_size:528384; rb->word_size:132096
Jan 26 02:10:44 [13235] destination-standby cib: debug:
cib_acl_enabled: CIB ACL is disabled
Jan 26 02:10:44 [13235] destination-standby cib: debug:
qb_ipcs_dispatch_connection_request: HUP conn (13235-6928-13)
Jan 26 02:10:44 [13235] destination-standby cib: debug:
qb_ipcs_disconnect: qb_ipcs_disconnect(13235-6928-13) state:2
Jan 26 02:10:44 [13235] destination-standby cib: debug:
crm_client_destroy: Destroying 0 events
Jan 26 02:10:44 [13235] destination-standby cib: debug:
qb_rb_close_helper: Free'ing ringbuffer:
/dev/shm/qb-cib_rw-response-13235-6928-13-header
Jan 26 02:10:44 [13235] destination-standby cib: debug:
qb_rb_close_helper: Free'ing ringbuffer:
/dev/shm/qb-cib_rw-event-13235-6928-13-header
Jan 26 02:10:44 [13235] destination-standby cib: debug:
qb_rb_close_helper: Free'ing ringbuffer:
/dev/shm/qb-cib_rw-request-13235-6928-13-header
Jan 26 02:10:44 [13240] destination-standby crmd: debug:
qb_ipcs_dispatch_connection_request: HUP conn (13240-6928-14)
Jan 26 02:10:44 [13240] destination-standby crmd: debug:
qb_ipcs_disconnect: qb_ipcs_disconnect(13240-6928-14) state:2
Jan 26 02:10:44 [13240] destination-standby crmd: debug:
crm_client_destroy: Destroying 0 events
Jan 26 02:10:44 [13240] destination-standby crmd: debug:
qb_rb_close_helper: Free'ing ringbuffer:
/dev/shm/qb-crmd-response-13240-6928-14-header
Jan 26 02:10:44 [13240] destination-standby crmd: debug:
qb_rb_close_helper: Free'ing ringbuffer:
/dev/shm/qb-crmd-event-13240-6928-14-header
Jan 26 02:10:44 [13240] destination-standby crmd: debug:
qb_rb_close_helper: Free'ing ringbuffer:
/dev/shm/qb-crmd-request-13240-6928-14-header
Jan 26 02:10:45 [13191] destination-standby corosync warning [MAIN ] Totem
is unable to form a cluster because of an operating system or network
fault. The most common cause of this message is that the local firewall is
configured improperly.
Jan 26 02:10:47 [13191] destination-standby corosync warning [MAIN ] Totem
is unable to form a cluster because of an operating system or network
fault. The most common cause of this message is that the local firewall is
configured improperly.
Jan 26 02:10:48 [13191] destination-standby corosync debug [TOTEM ] The
consensus timeout expired.
Jan 26 02:10:48 [13191] destination-standby corosync debug [TOTEM ]
entering GATHER state from 3(The consensus timeout expired.).
Jan 26 02:10:48 [13191] destination-standby corosync warning [MAIN ] Totem
is unable to form a cluster because of an operating system or network
fault. The most common cause of this message is that the local firewall is
configured improperly.
============
(target.standby)> cat /etc/corosync/corosync.conf
totem {
token: 5000
version: 2
secauth: off
threads: 0
interface {
ringnumber: 0
mcastport: 5405
bindnetaddr: 10.27.77.204
}
transport: udpu
}
nodelist {
node {
ring0_addr: 10.27.77.202
name: target.sip
nodeid: 3
}
node {
ring0_addr: 10.27.77.205
name: target.sec.sip
nodeid: 6
}
node {
ring0_addr: 10.27.77.106
name: target.dsbc1
nodeid: 7
}
node {
ring0_addr: 10.27.77.107
name: target.dsbc
nodeid: 8
}
node {
ring0_addr: 10.27.77.204
name: target.standby
nodeid: 5
}
}
logging {
fileline: off
to_stderr: no
to_logfile: yes
to_syslog: no
logfile: /var/10.27.77.204/log/cluster-suite/corosync.log
syslog_facility: local0
debug: on
timestamp: on
logger_subsys {
subsys: QUORUM
debug: on
}
}
quorum {
provider: corosync_votequorum
expected_votes: 5
}
===================
(target.standby)> ping 10.27.77.106 (target.dsbc1)
PING 10.27.77.106 (10.27.77.106) 56(84) bytes of data.
64 bytes from 10.27.77.106: icmp_seq=1 ttl=64 time=0.177 ms
64 bytes from 10.27.77.106: icmp_seq=2 ttl=64 time=0.173 ms
===================
> sudo nmap -sU -p 5405 10.27.77.106
Starting Nmap 6.40 ( http://nmap.org ) at 2021-01-26 02:24 UTC
Nmap scan report for dispatching-sbc (10.27.77.106)
Host is up (0.00029s latency).
PORT STATE SERVICE
5405/udp open|filtered unknown
=========
My observation showed that it can be fixed by stopping all nodes and
starting one by one.
Or restart of the affected node may help as well.
Nodes see each other when Ring ID is equal on all nodes.
Do you have any ideas?
Could you explain what is Ring ID in terms of corosync-quorum (rrp is not
configured)?
Why does it differ?
What additional info may I provide if reproduced again?
It is reproduced sporadically, sometimes the issue is combined with 100%
CPU usage by corosync also required to restart pacemaker/corosync on the
affected node and sometimes all nodes. In this particular case CPU usage is
normal.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20210126/a35546ab/attachment-0001.htm>
More information about the Users
mailing list