[ClusterLabs] Antw: Re: Spurious node loss in corosync cluster

Tue Aug 21 15:37:52 UTC 2018

On Tue, 2018-08-21 at 15:29 +0200, Ulrich Windl wrote:
> > > > Prasad Nagaraj <prasad.nagaraj76 at gmail.com> schrieb am
> > > > 21.08.2018 um 11:42 in
> 
> Nachricht
> <CAHbCUJ0zdvpYALCR7tbnGgb8qrZHh8uDjE+RsnkoewvmFb8wAg at mail.gmail.com>:
> > Hi Ken - Thanks for you response.
> > 
> > We do have seen messages in other cases like
> > corosync [MAIN  ] Corosync main process was not scheduled for
> > 17314.4746 ms
> > (threshold is 8000.0000 ms). Consider token timeout increase.
> > corosync [TOTEM ] A processor failed, forming new configuration.
> > 
> > Is this the indication of a failure due to CPU load issues and will
> > this
> > get resolved if I upgrade to Corosync 2.x series ?

Yes, most definitely this is a CPU issue. It means corosync isn't
getting enough CPU cycles to handle the cluster token before the
timeout is reached.

Upgrading may indeed help, as recent versions ensure that corosync runs
with real-time priority in the kernel, and thus are more likely to get
CPU time when something of lower priority is consuming all the CPU.

But of course, there is some underlying problem that should be
identified and addressed. Figure out what's maxing out the CPU or I/O.
Ulrich's monitoring suggestion is a good start.

> Hi!
> 
> I'd strongly recommend starting monitoring on your nodes, at least
> until you know what's going on. The good old UNIX sa (sysstat
> package) could be a starting point. I'd monitor CPU idle
> specifically. Then go for 100% device utilization, then look for
> network bottlenecks...
> 
> A new corosync release cannot fix those, most likely.
> 
> Regards,
> Ulrich
> 
> > 
> > In any case, for the current scenario, we did not see any
> > scheduling
> > related messages.
> > 
> > Thanks for your help.
> > Prasad
> > 
> > On Mon, Aug 20, 2018 at 7:57 PM, Ken Gaillot <kgaillot at redhat.com>
> > wrote:
> > 
> > > On Sun, 2018-08-19 at 17:35 +0530, Prasad Nagaraj wrote:
> > > > Hi:
> > > > 
> > > > One of these days, I saw a spurious node loss on my 3-node
> > > > corosync
> > > > cluster with following logged in the corosync.log of one of the
> > > > nodes.
> > > > 
> > > > Aug 18 12:40:25 corosync [pcmk  ] notice: pcmk_peer_update:
> > > > Transitional membership event on ring 32: memb=2, new=0, lost=1
> > > > Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: memb:
> > > > vm02d780875f 67114156
> > > > Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: memb:
> > > > vmfa2757171f 151000236
> > > > Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: lost:
> > > > vm728316982d 201331884
> > > > Aug 18 12:40:25 corosync [pcmk  ] notice: pcmk_peer_update:
> > > > Stable
> > > > membership event on ring 32: memb=2, new=0, lost=0
> > > > Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
> > > > vm02d780875f 67114156
> > > > Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
> > > > vmfa2757171f 151000236
> > > > Aug 18 12:40:25 corosync [pcmk  ] info:
> > > > ais_mark_unseen_peer_dead:
> > > > Node vm728316982d was not seen in the previous transition
> > > > Aug 18 12:40:25 corosync [pcmk  ] info: update_member: Node
> > > > 201331884/vm728316982d is now: lost
> > > > Aug 18 12:40:25 corosync [pcmk  ] info:
> > > > send_member_notification:
> > > > Sending membership update 32 to 3 children
> > > > Aug 18 12:40:25 corosync [TOTEM ] A processor joined or left
> > > > the
> > > > membership and a new membership was formed.
> > > > Aug 18 12:40:25 [4544] vmfa2757171f stonith-ng:     info:
> > > > plugin_handle_membership:     Membership 32: quorum retained
> > > > Aug 18 12:40:25 [4544] vmfa2757171f stonith-ng:   notice:
> > > > crm_update_peer_state_iter:   plugin_handle_membership: Node
> > > > vm728316982d[201331884] - state is now lost (was member)
> > > > Aug 18 12:40:25 [4548] vmfa2757171f       crmd:     info:
> > > > plugin_handle_membership:     Membership 32: quorum retained
> > > > Aug 18 12:40:25 [4548] vmfa2757171f       crmd:   notice:
> > > > crm_update_peer_state_iter:   plugin_handle_membership: Node
> > > > vm728316982d[201331884] - state is now lost (was member)
> > > > Aug 18 12:40:25 [4548] vmfa2757171f       crmd:     info:
> > > > peer_update_callback: vm728316982d is now lost (was member)
> > > > Aug 18 12:40:25 [4548] vmfa2757171f       crmd:  warning:
> > > > match_down_event:     No match for shutdown action on
> > > > vm728316982d
> > > > Aug 18 12:40:25 [4548] vmfa2757171f       crmd:   notice:
> > > > peer_update_callback: Stonith/shutdown of vm728316982d not
> > > > matched
> > > > Aug 18 12:40:25 [4548] vmfa2757171f       crmd:     info:
> > > > crm_update_peer_join: peer_update_callback: Node
> > > > vm728316982d[201331884] - join-6 phase 4 -> 0
> > > > Aug 18 12:40:25 [4548] vmfa2757171f       crmd:     info:
> > > > abort_transition_graph:       Transition aborted: Node failure
> > > > (source=peer_update_callback:240, 1)
> > > > Aug 18 12:40:25 [4543] vmfa2757171f        cib:     info:
> > > > plugin_handle_membership:     Membership 32: quorum retained
> > > > Aug 18 12:40:25 [4543] vmfa2757171f        cib:   notice:
> > > > crm_update_peer_state_iter:   plugin_handle_membership: Node
> > > > vm728316982d[201331884] - state is now lost (was member)
> > > > Aug 18 12:40:25 [4543] vmfa2757171f        cib:   notice:
> > > > crm_reap_dead_member: Removing vm728316982d/201331884 from the
> > > > membership list
> > > > Aug 18 12:40:25 [4543] vmfa2757171f        cib:   notice:
> > > > reap_crm_member:      Purged 1 peers with id=201331884 and/or
> > > > uname=vm728316982d from the membership cache
> > > > Aug 18 12:40:25 [4544] vmfa2757171f stonith-ng:   notice:
> > > > crm_reap_dead_member: Removing vm728316982d/201331884 from the
> > > > membership list
> > > > Aug 18 12:40:25 [4544] vmfa2757171f stonith-ng:   notice:
> > > > reap_crm_member:      Purged 1 peers with id=201331884 and/or
> > > > uname=vm728316982d from the membership cache
> > > > 
> > > > However, within seconds, the node was able to join back.
> > > > 
> > > > Aug 18 12:40:34 corosync [pcmk  ] notice: pcmk_peer_update:
> > > > Stable
> > > > membership event on ring 36: memb=3, new=1, lost=0
> > > > Aug 18 12:40:34 corosync [pcmk  ] info: update_member: Node
> > > > 201331884/vm728316982d is now: member
> > > > Aug 18 12:40:34 corosync [pcmk  ] info: pcmk_peer_update: NEW:
> > > > vm728316982d 201331884
> > > > 
> > > > 
> > > > But this was enough time for the cluster to get into split
> > > > brain kind
> > > > of situation with  a resource on the node vm728316982d being
> > > > stopped
> > > > because of this node loss detection.
> > > > 
> > > > Could anyone help whether this could happen due to any
> > > > transient
> > > > network distortion or so ?
> > > > Are there any configuration settings that can be applied in
> > > > corosync.conf so that cluster is more resilient to such
> > > > temporary
> > > > distortions.
> > > 
> > > Your corosync sensitivity of 10-second token timeout and 10
> > > retransimissions is already very lengthy -- likely the node was
> > > already
> > > unresponsive for more than 10 seconds before the first message
> > > above,
> > > so it was more than 18 seconds before it rejoined.
> > > 
> > > It's rarely a good idea to change
> > > token_retransmits_before_loss_const;
> > > changing token is generally enough to deal with transient network
> > > unreliability. However 18 seconds is a really long time to raise
> > > the
> > > token to, and it's uncertain from the information here whether
> > > the root
> > > cause was networking or something on the host.
> > > 
> > > I notice your configuration is corosync 1 with the pacemaker
> > > plugin;
> > > that is a long-deprecated setup, and corosync 3 is about to come
> > > out,
> > > so you may want to consider upgrading to at least corosync 2 and
> > > a
> > > reasonably recent pacemaker. That would give you some reliability
> > > improvements, including real-time priority scheduling of
> > > corosync,
> > > which could have been the issue here if CPU load rather than
> > > networking
> > > was the root cause.
> > > 
> > > > 
> > > > Currently my corosync.conf looks like this :
> > > > 
> > > > compatibility: whitetank
> > > > totem {
> > > >     version: 2
> > > >     secauth: on
> > > >     threads: 0
> > > >     interface {
> > > >     member {
> > > >             memberaddr: 172.20.0.4
> > > >         }
> > > > member {
> > > >             memberaddr: 172.20.0.9
> > > >         }
> > > > member {
> > > >             memberaddr: 172.20.0.12
> > > >         }
> > > > 
> > > >     bindnetaddr: 172.20.0.12
> > > > 
> > > >     ringnumber: 0
> > > >     mcastport: 5405
> > > >     ttl: 1
> > > >     }
> > > >     transport: udpu
> > > >     token: 10000
> > > >     token_retransmits_before_loss_const: 10
> > > > }
> > > > 
> > > > logging {
> > > >     fileline: off
> > > >     to_stderr: yes
> > > >     to_logfile: yes
> > > >     to_syslog: no
> > > >     logfile: /var/log/cluster/corosync.log
> > > >     timestamp: on
> > > >     logger_subsys {
> > > >     subsys: AMF
> > > >     debug: off
> > > >     }
> > > > }
> > > > service {
> > > >     name: pacemaker
> > > >     ver: 1
> > > > }
> > > > amf {
> > > >     mode: disabled
> > > > }
> > > > 
> > > > Thanks in advance for the help.
> > > > Prasad
> > > > 
> > > > _______________________________________________
> > > > Users mailing list: Users at clusterlabs.org 
> > > > https://lists.clusterlabs.org/mailman/listinfo/users 
> > > > 
> > > > Project Home: http://www.clusterlabs.org 
> > > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Sc
> > > > ratch.
> > > > pdf
> > > > Bugs: http://bugs.clusterlabs.org 
> > > 
> > > --
> > > Ken Gaillot <kgaillot at redhat.com>
> > > _______________________________________________
> > > Users mailing list: Users at clusterlabs.org 
> > > https://lists.clusterlabs.org/mailman/listinfo/users 
> > > 
> > > Project Home: http://www.clusterlabs.org 
> > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scra
> > > tch.pdf 
> > > Bugs: http://bugs.clusterlabs.org 
> > > 
> 
> 
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.
> pdf
> Bugs: http://bugs.clusterlabs.org
-- 
Ken Gaillot <kgaillot at redhat.com>