[ClusterLabs] Corosync node gets unique Ring ID

Igor Tverdovskiy igor.tverdovskiy.pe at gmail.com
Wed Jan 27 04:53:42 EST 2021


Hi Christine,

Thanks for your input!

> It worries me that corosync-quorumtool behaves differently on some nodes
> - some show names, some just IP addresses. That could be a cause of some
> inconsistency.

Haven't noticed this, we use static node names (not hostnames), so
this even more strange

> So what seems to be happening is that the cluster is being partitioned
> somehow (I can't tell why, that's something you'll need to investigate)
> and corosync isn't recovering very well from it. One of the things that
> can make this happen is doing "ifdown" - which that old version of
> coroysnc doesn't cope with very well. Even if that's not exactly what
> you are doing (and I see no reason to beleive you are) I do wonder if
> something similar is happening by other means - NetworkManager perhaps?)

These are proxmox VMs on one physical host, no changes to filtering
was made for sure and network issues are mainly excluded because of
this.
As for "ifdown" it is also not touched by us, but yes, NetworkManager
is configured there.
I have disabled reloading (since it is not necessary for current
reconfiguration approach)
will look after it.

> Oh and also, try to upgrade to corosync 2.4.5 at least. I'm sure that
> will help.

We freezed updates at some point, because the new pacemaker release
changed "stickiness=-1" logic.
> resource-stickiness processing has been changed somehow in Pacemaker 1.1.16 as resource-stickiness=-1 moves cloned resources from one node to another constantly.
> Fair to mention resource-stickiness=0 is not working as expected even for previous versions, so -1 was used as a workaround.

however at the moment we do not use cloned resources and will consider
updating, thanks.


On Wed, Jan 27, 2021 at 10:18 AM Christine Caulfield
<ccaulfie at redhat.com> wrote:
>
> A few things really stand out from this report, I think the inconsistent
> ring_id is just a symptom.
>
> It worries me that corosync-quorumtool behaves differently on some nodes
> - some show names, some just IP addresses. That could be a cause of some
> inconsistency.
>
> Also the messages
> "
> Jan 26 02:10:45 [13191] destination-standby corosync warning [MAIN  ]
> Totem is unable to form a cluster because of an operating system or
> network fault. The most common cause of this message is that the local
> firewall is configured improperly.
> Jan 26 02:10:47 [13191] destination-standby corosync warning [MAIN  ]
> Totem is unable to form a cluster because of an operating system or
> network fault. The most common cause of this message is that the local
> firewall is configured improperly.
> Jan 26 02:10:48 [13191] destination-standby corosync debug   [TOTEM ]
> The consensus timeout expired.
> Jan 26 02:10:48 [13191] destination-standby corosync debug   [TOTEM ]
> entering GATHER state from 3(The consensus timeout expired.).
> Jan 26 02:10:48 [13191] destination-standby corosync warning [MAIN  ]
> Totem is unable to form a cluster because of an operating system or
> network fault. The most common cause of this message is that the local
> firewall is configured improperly."
>
> Are a BAD sign. All this is contributing to the problems and also the
> timeout on reload (which is reallly not a good thing). Those messages
> are not caused by the reload, they are caused by some networking problems.
>
> So what seems to be happening is that the cluster is being partitioned
> somehow (I can't tell why, that's something you'll need to investigate)
> and corosync isn't recovering very well from it. One of the things that
> can make this happen is doing "ifdown" - which that old version of
> coroysnc doesn't cope with very well. Even if that's not exactly what
> you are doing (and I see no reason to beleive you are) I do wonder if
> something similar is happening by other means - NetworkManager perhaps?)
>
> So firstly, check the networking setup and be sure that all te nodes are
> consistently configures and check that the network is not closing down
> interfaces or ports at the time of the incident.
>
> Oh and also, try to upgrade to corosync 2.4.5 at least. I'm sure that
> will help.
>
> Chrissie
>
>
>
> On 26/01/2021 02:45, Igor Tverdovskiy wrote:
> > Hi All,
> >
> >  > pacemakerd -$
> > Pacemaker 1.1.15-11.el7
> >
> >  > corosync -v
> > Corosync Cluster Engine, version '2.4.0'
> >
> >  > rpm -qi libqb
> > Name        : libqb
> > Version     : 1.0.1
> >
> > Please assist. Recently faced a strange bug (I suppose), when one of the
> > cluster nodes gets different from others "Ring ID" for example after
> > corosync config reload , e.g.:
> >
> >
> > *Affected node:*
> > ============
> > (target.standby)> sudo corosync-quorumtool
> > Quorum information
> > ------------------
> > Date:             Tue Jan 26 01:58:54 2021
> > Quorum provider:  corosync_votequorum
> > Nodes:            5
> > Node ID:          5
> > Ring ID: *7/59268* <<<<<<<
> > Quorate:          Yes
> >
> > Votequorum information
> > ----------------------
> > Expected votes:   5
> > Highest expected: 5
> > Total votes:      5
> > Quorum:           3
> > Flags:            Quorate
> >
> > Membership information
> > ----------------------
> >      Nodeid      Votes Name
> >           7          1 dispatching-sbc
> >           8          1 dispatching-sbc-2-6
> >           3          1 10.27.77.202
> >           5          1 cassandra-3 (local)
> >           6          1 10.27.77.205
> >
> > ============
> >
> > *OK nodes:*
> >  > sudo corosync-quorumtool
> > Quorum information
> > ------------------
> > Date:             Tue Jan 26 01:59:13 2021
> > Quorum provider:  corosync_votequorum
> > Nodes:            4
> > Node ID:          8
> > Ring ID: *7/59300* <<<<<<<
> > Quorate:          Yes
> >
> > Votequorum information
> > ----------------------
> > Expected votes:   5
> > Highest expected: 5
> > Total votes:      4
> > Quorum:           3
> > Flags:            Quorate
> >
> > Membership information
> > ----------------------
> >      Nodeid      Votes Name
> >           7          1 10.27.77.106
> >           8          1 10.27.77.107 (local)
> >           3          1 10.27.77.202
> >           6          1 10.27.77.205
> > ============
> >
> >
> > Also strange is that *crm status shows only two of five nodes* on the
> > affected node, but at the same time
> > *"sudo crm_node -l" shows all 5 nodes as members*.
> > ============
> > (target.standby)> sudo crm_node -l
> > 5 target.standby member
> > 7 target.dsbc1 member
> > 3 target.sip member
> > 8 target.dsbc member
> > 6 target.sec.sip member
> >
> > -------
> >
> > (target.standby)> sudo crm status
> > Stack: corosync
> > Current DC: target.sip (version 1.1.15-11.el7-e174ec8) - partition with
> > quorum
> > Last updated: Tue Jan 26 02:08:02 2021          Last change: Mon Jan 25
> > 14:27:18 2021 by root via crm_node on target.sec.sip
> >
> > 2 nodes and 7 resources configured
> >
> > Online: [ target.sec.sip target.sip ] <<<<<<
> >
> > Full list of resources:
> > ============
> >
> > The issue here is that crm configure operations fail with timeout error:
> > ============
> > (target.standby)> sudo crm configure property maintenance-mode=true
> > *Call cib_apply_diff failed (-62): Timer expired*
> > ERROR: could not patch cib (rc=62)
> > INFO: offending xml diff: <diff format="2">
> >    <change operation="modify"
> > path="/cib/configuration/crm_config/cluster_property_set[@id='cib-bootstrap-options']/nvpair[@id='cib-bootstrap-options-maintenance-mode']">
> >      <change-list>
> >        <change-attr name="value" operation="set" value="true"/>
> >      </change-list>
> >      <change-result>
> >        <nvpair name="maintenance-mode" value="true"
> > id="cib-bootstrap-options-maintenance-mode"/>
> >      </change-result>
> >    </change>
> > </diff>
> > ============
> >
> >
> > In the log there are errors that totem is unable to form a cluster:
> > ============
> > (target.standby)
> >
> > *First entry seems caused by node reloading (corosync-cfgtool -R):*
> > *
> > *Jan 26 01:40:35 [13190] destination-standby corosync notice  [CFG   ]
> > Config reload requested by node 7
> > Jan 26 01:40:35 [13190] destination-standby corosync debug   [TOTEM ]
> > removing dynamic member 10.27.77.106 for ring 0
> > Jan 26 01:40:35 [13190] destination-standby corosync notice  [TOTEM ]
> > removing UDPU member {10.27.77.106}
> > Jan 26 01:40:35 [13190] destination-standby corosync debug   [TOTEM ]
> > Closing socket to: {10.27.77.106}
> > Jan 26 01:40:35 [13190] destination-standby corosync debug   [TOTEM ]
> > removing dynamic member 10.27.77.107 for ring 0
> > Jan 26 01:40:35 [13190] destination-standby corosync notice  [TOTEM ]
> > removing UDPU member {10.27.77.107}
> > Jan 26 01:40:35 [13190] destination-standby corosync debug   [TOTEM ]
> > Closing socket to: {10.27.77.107}
> > Jan 26 01:40:35 [13190] destination-standby corosync debug   [TOTEM ]
> > removing dynamic member 10.27.77.204 for ring 0
> > Jan 26 01:40:35 [13190] destination-standby corosync notice  [TOTEM ]
> > removing UDPU member {10.27.77.204}
> > Jan 26 01:40:35 [13190] destination-standby corosync debug   [TOTEM ]
> > Closing socket to: {10.27.77.204}
> > Jan 26 01:40:35 [13190] destination-standby corosync debug   [TOTEM ]
> > Configuration reloaded. Dumping actual totem config.
> > Jan 26 01:40:35 [13190] destination-standby corosync debug   [TOTEM ]
> > Token Timeout (5000 ms) retransmit timeout (1190 ms)
> > Jan 26 01:40:35 [13190] destination-standby corosync debug   [TOTEM ]
> > token hold (942 ms) retransmits before loss (4 retrans)
> > Jan 26 01:40:35 [13190] destination-standby corosync debug   [TOTEM ]
> > join (50 ms) send_join (0 ms) consensus (6000 ms) merge (200 ms)
> > Jan 26 01:40:35 [13190] destination-standby corosync debug   [TOTEM ]
> > downcheck (1000 ms) fail to recv const (2500 msgs)
> > Jan 26 01:40:35 [13190] destination-standby corosync debug   [TOTEM ]
> > seqno unchanged const (30 rotations) Maximum network MTU 1369
> > Jan 26 01:40:35 [13190] destination-standby corosync debug   [TOTEM ]
> > window size per rotation (50 messages) maximum messages per rotation (17
> > messages)
> > Jan 26 01:40:35 [13190] destination-standby corosync debug   [TOTEM ]
> > missed count const (5 messages)
> > Jan 26 01:40:35 [13190] destination-standby corosync debug   [TOTEM ]
> > RRP token expired timeout (1190 ms)
> > Jan 26 01:40:35 [13190] destination-standby corosync debug   [TOTEM ]
> > RRP token problem counter (2000 ms)
> > Jan 26 01:40:35 [13190] destination-standby corosync debug   [TOTEM ]
> > RRP threshold (10 problem count)
> > Jan 26 01:40:35 [13190] destination-standby corosync debug   [TOTEM ]
> > RRP multicast threshold (100 problem count)
> > Jan 26 01:40:35 [13190] destination-standby corosync debug   [TOTEM ]
> > RRP automatic recovery check timeout (1000 ms)
> > Jan 26 01:40:35 [13190] destination-standby corosync debug   [TOTEM ]
> > RRP mode set to none.
> > Jan 26 01:40:35 [13190] destination-standby corosync debug   [TOTEM ]
> > heartbeat_failures_allowed (0)
> > Jan 26 01:40:35 [13190] destination-standby corosync debug   [TOTEM ]
> > max_network_delay (50 ms)
> > Jan 26 01:40:35 [13190] destination-standby corosync debug   [VOTEQ ]
> > Reading configuration (runtime: 1)
> > Jan 26 01:40:35 [13190] destination-standby corosync debug   [VOTEQ ] No
> > nodelist defined or our node is not in the nodelist
> > Jan 26 01:40:35 [13190] destination-standby corosync debug   [VOTEQ ]
> > ev_tracking=0, ev_tracking_barrier = 0: expected_votes = 2
> > Jan 26 01:40:35 [13190] destination-standby corosync debug   [VOTEQ ]
> > flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No
> > QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No
> > Jan 26 01:40:35 [13191] destination-standby corosync debug   [VOTEQ ]
> > got nodeinfo message from cluster node 8
> > Jan 26 01:40:35 [13191] destination-standby corosync debug   [VOTEQ ]
> > nodeinfo message[8]: votes: 1, expected: 5 flags: 1
> > Jan 26 01:40:35 [13191] destination-standby corosync debug   [VOTEQ ]
> > flags: quorate: Yes Leaving: No WFA Status: No First: No Qdevice: No
> > QdeviceAlive: No QdeviceCastVote: No QdeviceMasterWins: No
> > Jan 26 01:40:35 [13191] destination-standby corosync debug   [VOTEQ ]
> > total_votes=5, expected_votes=2
> > Jan 26 01:40:35 [13191] destination-standby corosync debug   [VOTEQ ]
> > Sending expected votes callback
> > Jan 26 01:40:35 [13191] destination-standby corosync debug   [VOTEQ ]
> > node 3 state=1, votes=1, expected=5
> > Jan 26 01:40:35 [13191] destination-standby corosync debug   [VOTEQ ]
> > node 5 state=1, votes=1, expected=5
> > Jan 26 01:40:35 [13191] destination-standby corosync debug   [VOTEQ ]
> > node 6 state=1, votes=1, expected=5
> > Jan 26 01:40:35 [13191] destination-standby corosync debug   [VOTEQ ]
> > node 7 state=1, votes=1, expected=5
> > Jan 26 01:40:35 [13191] destination-standby corosync debug   [VOTEQ ]
> > node 8 state=1, votes=1, expected=5
> > Jan 26 01:40:35 [13191] destination-standby corosync debug   [VOTEQ ]
> > lowest node id: 3 us: 5
> > Jan 26 01:40:35 [13191] destination-standby corosync debug   [VOTEQ ]
> > highest node id: 8 us: 5
> > Jan 26 01:40:35 [13191] destination-standby corosync debug   [VOTEQ ]
> > got nodeinfo message from cluster node 8
> > Jan 26 01:40:35 [13191] destination-standby corosync debug   [VOTEQ ]
> > nodeinfo message[0]: votes: 0, expected: 0 flags: 0
> > Jan 26 01:40:38 [13191] destination-standby corosync error   [TOTEM ]
> > FAILED TO RECEIVE
> > Jan 26 01:40:38 [13191] destination-standby corosync debug   [TOTEM ]
> > entering GATHER state from 6(failed to receive).
> > Jan 26 01:40:44 [13191] destination-standby corosync debug   [TOTEM ]
> > entering GATHER state from 0(consensus timeout).
> > Jan 26 01:40:48 [13240] destination-standby       crmd:    debug:
> > throttle_cib_load:    cib load: 0.000333 (1 ticks in 30s)
> > Jan 26 01:40:48 [13240] destination-standby       crmd:    debug:
> > throttle_load_avg:    Current load is 1.010000 (full: 1.01 0.52 0.45
> > 2/471 10513)
> > Jan 26 01:40:48 [13240] destination-standby       crmd:    debug:
> > throttle_io_load:     Current IO load is 0.000000
> > Jan 26 01:40:55 [13191] destination-standby corosync debug   [TOTEM ]
> > The consensus timeout expired.
> > Jan 26 01:40:55 [13191] destination-standby corosync debug   [TOTEM ]
> > entering GATHER state from 3(The consensus timeout expired.).
> > Jan 26 01:41:06 [13191] destination-standby corosync debug   [TOTEM ]
> > The consensus timeout expired.
> > Jan 26 01:41:06 [13191] destination-standby corosync debug   [TOTEM ]
> > entering GATHER state from 3(The consensus timeout expired.).
> > Jan 26 01:41:17 [13191] destination-standby corosync debug   [TOTEM ]
> > The consensus timeout expired.
> > Jan 26 01:41:17 [13191] destination-standby corosync debug   [TOTEM ]
> > entering GATHER state from 3(The consensus timeout expired.).
> > Jan 26 01:41:18 [13191] destination-standby corosync debug   [QB    ]
> > IPC credentials authenticated (13191-14633-16)
> > Jan 26 01:41:18 [13191] destination-standby corosync debug   [QB    ]
> > connecting to client [14633]
> > Jan 26 01:41:18 [13191] destination-standby corosync debug   [QB    ]
> > shm size:1048589; real_size:1052672; rb->word_size:263168
> > Jan 26 01:41:18 [13191] destination-standby corosync debug   [QB    ]
> > shm size:1048589; real_size:1052672; rb->word_size:263168
> > Jan 26 01:41:18 [13191] destination-standby corosync debug   [QB    ]
> > shm size:1048589; real_size:1052672; rb->word_size:263168
> > Jan 26 01:41:18 [13191] destination-standby corosync debug   [MAIN  ]
> > connection created
> > Jan 26 01:41:18 [13240] destination-standby       crmd:    debug:
> > throttle_cib_load:    cib load: 0.000000 (0 ticks in 30s)
> > Jan 26 01:41:18 [13240] destination-standby       crmd:    debug:
> > throttle_load_avg:    Current load is 1.360000 (full: 1.36 0.64 0.49
> > 1/475 14633)
> > Jan 26 01:41:18 [13240] destination-standby       crmd:    debug:
> > throttle_io_load:     Current IO load is 0.000000
> > Jan 26 01:41:28 [13191] destination-standby corosync debug   [TOTEM ]
> > The consensus timeout expired.
> > Jan 26 01:41:28 [13191] destination-standby corosync debug   [TOTEM ]
> > entering GATHER state from 3(The consensus timeout expired.).
> > Jan 26 01:41:28 [13191] destination-standby corosync warning [MAIN  ]
> > Totem is unable to form a cluster because of an operating system or
> > network fault. The most common cause of this message is that the local
> > firewall is configured improperly.
> > Jan 26 01:41:30 [13191] destination-standby corosync warning [MAIN  ]
> > Totem is unable to form a cluster because of an operating system or
> > network fault. The most common cause of this message is that the local
> > firewall is configured improperly.
> >
> > Jan 26 02:10:42 [13191] destination-standby corosync warning [MAIN  ]
> > *Totem is unable to form a cluster because of an operating system or
> > network fault. The most common cause of this message is that the local
> > firewall is configured improperly.*
> > Jan 26 02:10:42 [13235] destination-standby        cib:    debug:
> > crm_client_new:       Connecting 0x560a4bef1f60 for uid=0 gid=0 pid=6916
> > id=5a414534-bc62-4544-a5d2-6deb772a6b49
> > Jan 26 02:10:42 [13235] destination-standby        cib:    debug:
> > handle_new_connection:        IPC credentials authenticated (13235-6916-13)
> > Jan 26 02:10:42 [13235] destination-standby        cib:    debug:
> > qb_ipcs_shm_connect:  connecting to client [6916]
> > Jan 26 02:10:42 [13235] destination-standby        cib:    debug:
> > qb_rb_open_2: shm size:524301; real_size:528384; rb->word_size:132096
> > Jan 26 02:10:42 [13235] destination-standby        cib:    debug:
> > qb_rb_open_2: shm size:524301; real_size:528384; rb->word_size:132096
> > Jan 26 02:10:42 [13235] destination-standby        cib:    debug:
> > qb_rb_open_2: shm size:524301; real_size:528384; rb->word_size:132096
> > Jan 26 02:10:42 [13235] destination-standby        cib:    debug:
> > cib_acl_enabled:      CIB ACL is disabled
> > Jan 26 02:10:42 [13235] destination-standby        cib:    debug:
> > qb_ipcs_dispatch_connection_request:  HUP conn (13235-6916-13)
> > Jan 26 02:10:42 [13235] destination-standby        cib:    debug:
> > qb_ipcs_disconnect:   qb_ipcs_disconnect(13235-6916-13) state:2
> > Jan 26 02:10:42 [13235] destination-standby        cib:    debug:
> > crm_client_destroy:   Destroying 0 events
> > Jan 26 02:10:42 [13235] destination-standby        cib:    debug:
> > qb_rb_close_helper:   Free'ing ringbuffer:
> > /dev/shm/qb-cib_rw-response-13235-6916-13-header
> > Jan 26 02:10:42 [13235] destination-standby        cib:    debug:
> > qb_rb_close_helper:   Free'ing ringbuffer:
> > /dev/shm/qb-cib_rw-event-13235-6916-13-header
> > Jan 26 02:10:42 [13235] destination-standby        cib:    debug:
> > qb_rb_close_helper:   Free'ing ringbuffer:
> > /dev/shm/qb-cib_rw-request-13235-6916-13-header
> > Jan 26 02:10:42 [13240] destination-standby       crmd:    debug:
> > crm_client_new:       Connecting 0x55ce3bb0e6f0 for uid=0 gid=0 pid=6919
> > id=3a72b777-daa0-4b0e-acc2-fc58a07f31a6
> > Jan 26 02:10:42 [13240] destination-standby       crmd:    debug:
> > handle_new_connection:        IPC credentials authenticated (13240-6919-13)
> > Jan 26 02:10:42 [13240] destination-standby       crmd:    debug:
> > qb_ipcs_shm_connect:  connecting to client [6919]
> > Jan 26 02:10:42 [13240] destination-standby       crmd:    debug:
> > qb_rb_open_2: shm size:131085; real_size:135168; rb->word_size:33792
> > Jan 26 02:10:42 [13240] destination-standby       crmd:    debug:
> > qb_rb_open_2: shm size:131085; real_size:135168; rb->word_size:33792
> > Jan 26 02:10:42 [13240] destination-standby       crmd:    debug:
> > qb_rb_open_2: shm size:131085; real_size:135168; rb->word_size:33792
> > Jan 26 02:10:44 [13191] destination-standby corosync warning [MAIN  ]
> > Totem is unable to form a cluster because of an operating system or
> > network fault. The most common cause of this message is that the local
> > firewall is configured improperly.
> > Jan 26 02:10:44 [13240] destination-standby       crmd:    debug:
> > qb_ipcs_dispatch_connection_request:  HUP conn (13240-6125-14)
> > Jan 26 02:10:44 [13240] destination-standby       crmd:    debug:
> > qb_ipcs_disconnect:   qb_ipcs_disconnect(13240-6125-14) state:2
> > Jan 26 02:10:44 [13240] destination-standby       crmd:    debug:
> > crm_client_destroy:   Destroying 0 events
> > Jan 26 02:10:44 [13240] destination-standby       crmd:    debug:
> > qb_rb_close_helper:   Free'ing ringbuffer:
> > /dev/shm/qb-crmd-response-13240-6125-14-header
> > Jan 26 02:10:44 [13240] destination-standby       crmd:    debug:
> > qb_rb_close_helper:   Free'ing ringbuffer:
> > /dev/shm/qb-crmd-event-13240-6125-14-header
> > Jan 26 02:10:44 [13240] destination-standby       crmd:    debug:
> > qb_rb_close_helper:   Free'ing ringbuffer:
> > /dev/shm/qb-crmd-request-13240-6125-14-header
> > Jan 26 02:10:44 [13240] destination-standby       crmd:    debug:
> > crm_client_new:       Connecting 0x55ce3bb22960 for uid=0 gid=0 pid=6928
> > id=039044f3-e674-4afa-9857-47459d1f0d0a
> > Jan 26 02:10:44 [13240] destination-standby       crmd:    debug:
> > handle_new_connection:        IPC credentials authenticated (13240-6928-14)
> > Jan 26 02:10:44 [13240] destination-standby       crmd:    debug:
> > qb_ipcs_shm_connect:  connecting to client [6928]
> > Jan 26 02:10:44 [13240] destination-standby       crmd:    debug:
> > qb_rb_open_2: shm size:131085; real_size:135168; rb->word_size:33792
> > Jan 26 02:10:44 [13240] destination-standby       crmd:    debug:
> > qb_rb_open_2: shm size:131085; real_size:135168; rb->word_size:33792
> > Jan 26 02:10:44 [13240] destination-standby       crmd:    debug:
> > qb_rb_open_2: shm size:131085; real_size:135168; rb->word_size:33792
> > Jan 26 02:10:44 [13235] destination-standby        cib:    debug:
> > crm_client_new:       Connecting 0x560a4bef1f60 for uid=0 gid=0 pid=6928
> > id=45726483-e7b8-4ed5-8388-c6e8578d3366
> > Jan 26 02:10:44 [13235] destination-standby        cib:    debug:
> > handle_new_connection:        IPC credentials authenticated (13235-6928-13)
> > Jan 26 02:10:44 [13235] destination-standby        cib:    debug:
> > qb_ipcs_shm_connect:  connecting to client [6928]
> > Jan 26 02:10:44 [13235] destination-standby        cib:    debug:
> > qb_rb_open_2: shm size:524301; real_size:528384; rb->word_size:132096
> > Jan 26 02:10:44 [13235] destination-standby        cib:    debug:
> > qb_rb_open_2: shm size:524301; real_size:528384; rb->word_size:132096
> > Jan 26 02:10:44 [13235] destination-standby        cib:    debug:
> > qb_rb_open_2: shm size:524301; real_size:528384; rb->word_size:132096
> > Jan 26 02:10:44 [13235] destination-standby        cib:    debug:
> > cib_acl_enabled:      CIB ACL is disabled
> > Jan 26 02:10:44 [13235] destination-standby        cib:    debug:
> > qb_ipcs_dispatch_connection_request:  HUP conn (13235-6928-13)
> > Jan 26 02:10:44 [13235] destination-standby        cib:    debug:
> > qb_ipcs_disconnect:   qb_ipcs_disconnect(13235-6928-13) state:2
> > Jan 26 02:10:44 [13235] destination-standby        cib:    debug:
> > crm_client_destroy:   Destroying 0 events
> > Jan 26 02:10:44 [13235] destination-standby        cib:    debug:
> > qb_rb_close_helper:   Free'ing ringbuffer:
> > /dev/shm/qb-cib_rw-response-13235-6928-13-header
> > Jan 26 02:10:44 [13235] destination-standby        cib:    debug:
> > qb_rb_close_helper:   Free'ing ringbuffer:
> > /dev/shm/qb-cib_rw-event-13235-6928-13-header
> > Jan 26 02:10:44 [13235] destination-standby        cib:    debug:
> > qb_rb_close_helper:   Free'ing ringbuffer:
> > /dev/shm/qb-cib_rw-request-13235-6928-13-header
> > Jan 26 02:10:44 [13240] destination-standby       crmd:    debug:
> > qb_ipcs_dispatch_connection_request:  HUP conn (13240-6928-14)
> > Jan 26 02:10:44 [13240] destination-standby       crmd:    debug:
> > qb_ipcs_disconnect:   qb_ipcs_disconnect(13240-6928-14) state:2
> > Jan 26 02:10:44 [13240] destination-standby       crmd:    debug:
> > crm_client_destroy:   Destroying 0 events
> > Jan 26 02:10:44 [13240] destination-standby       crmd:    debug:
> > qb_rb_close_helper:   Free'ing ringbuffer:
> > /dev/shm/qb-crmd-response-13240-6928-14-header
> > Jan 26 02:10:44 [13240] destination-standby       crmd:    debug:
> > qb_rb_close_helper:   Free'ing ringbuffer:
> > /dev/shm/qb-crmd-event-13240-6928-14-header
> > Jan 26 02:10:44 [13240] destination-standby       crmd:    debug:
> > qb_rb_close_helper:   Free'ing ringbuffer:
> > /dev/shm/qb-crmd-request-13240-6928-14-header
> > Jan 26 02:10:45 [13191] destination-standby corosync warning [MAIN  ]
> > Totem is unable to form a cluster because of an operating system or
> > network fault. The most common cause of this message is that the local
> > firewall is configured improperly.
> > Jan 26 02:10:47 [13191] destination-standby corosync warning [MAIN  ]
> > Totem is unable to form a cluster because of an operating system or
> > network fault. The most common cause of this message is that the local
> > firewall is configured improperly.
> > Jan 26 02:10:48 [13191] destination-standby corosync debug   [TOTEM ]
> > The consensus timeout expired.
> > Jan 26 02:10:48 [13191] destination-standby corosync debug   [TOTEM ]
> > entering GATHER state from 3(The consensus timeout expired.).
> > Jan 26 02:10:48 [13191] destination-standby corosync warning [MAIN  ]
> > Totem is unable to form a cluster because of an operating system or
> > network fault. The most common cause of this message is that the local
> > firewall is configured improperly.
> > ============
> >
> > (target.standby)> cat /etc/corosync/corosync.conf
> > totem {
> >          token: 5000
> >          version: 2
> >          secauth: off
> >          threads: 0
> >          interface {
> >                  ringnumber: 0
> >                  mcastport: 5405
> >                  bindnetaddr: 10.27.77.204
> >          }
> >          transport: udpu
> > }
> >
> > nodelist {
> >          node {
> >              ring0_addr: 10.27.77.202
> >              name: target.sip
> >              nodeid: 3
> >
> >          }
> >          node {
> >              ring0_addr: 10.27.77.205
> >              name: target.sec.sip
> >              nodeid: 6
> >
> >          }
> >          node {
> >              ring0_addr: 10.27.77.106
> >              name: target.dsbc1
> >              nodeid: 7
> >
> >          }
> >          node {
> >              ring0_addr: 10.27.77.107
> >              name: target.dsbc
> >              nodeid: 8
> >
> >          }
> >          node {
> >              ring0_addr: 10.27.77.204
> >              name: target.standby
> >              nodeid: 5
> >
> >          }
> > }
> >
> > logging {
> >          fileline: off
> >          to_stderr: no
> >          to_logfile: yes
> >          to_syslog: no
> >          logfile: /var/10.27.77.204/log/cluster-suite/corosync.log
> > <http://10.27.77.204/log/cluster-suite/corosync.log>
> >          syslog_facility: local0
> >          debug: on
> >          timestamp: on
> >          logger_subsys {
> >                  subsys: QUORUM
> >                  debug: on
> >          }
> > }
> >
> > quorum {
> >          provider: corosync_votequorum
> >          expected_votes: 5
> > }
> >
> > ===================
> >
> > (target.standby)> ping 10.27.77.106 (target.dsbc1)
> > PING 10.27.77.106 (10.27.77.106) 56(84) bytes of data.
> > 64 bytes from 10.27.77.106 <http://10.27.77.106>: icmp_seq=1 ttl=64
> > time=0.177 ms
> > 64 bytes from 10.27.77.106 <http://10.27.77.106>: icmp_seq=2 ttl=64
> > time=0.173 ms
> >
> > ===================
> >
> >  > sudo nmap -sU -p 5405 10.27.77.106
> >
> > Starting Nmap 6.40 ( http://nmap.org <http://nmap.org> ) at 2021-01-26
> > 02:24 UTC
> > Nmap scan report for dispatching-sbc (10.27.77.106)
> > Host is up (0.00029s latency).
> > PORT     STATE         SERVICE
> > 5405/udp open|filtered unknown
> >
> > =========
> >
> > My observation showed that it can be fixed by stopping all nodes and
> > starting one by one.
> > Or restart of the affected node may help as well.
> > Nodes see each other when Ring ID is equal on all nodes.
> >
> > Do you have any ideas?
> > Could you explain what is Ring ID in terms of corosync-quorum (rrp is
> > not configured)?
> > Why does it differ?
> > What additional info may I provide if reproduced again?
> >
> > It is reproduced sporadically, sometimes the issue is combined with 100%
> > CPU usage by corosync also required to restart pacemaker/corosync on the
> > affected node and sometimes all nodes. In this particular case CPU usage
> > is normal.
> >
> > _______________________________________________
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > ClusterLabs home: https://www.clusterlabs.org/
> >
>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/


More information about the Users mailing list