[ClusterLabs] Antw: Re: Spurious node loss in corosync cluster

Wed Aug 22 00:59:59 UTC 2018

Thanks Ken and Ulrich. There is definitely high IO on the system with
sometimes IOWAIT s of upto 90%
I have come across some previous posts that IOWAIT is also considered as
CPU load by Corosync. Is this true ? Does having high IO may lead corosync
complain as in " Corosync main process was not scheduled for..." or "High
CPU load detected.." ?

I will surely monitor the system more.

Thanks for your help.
Prasad

On Tue, Aug 21, 2018 at 9:07 PM, Ken Gaillot <kgaillot at redhat.com> wrote:

> On Tue, 2018-08-21 at 15:29 +0200, Ulrich Windl wrote:
> > > > > Prasad Nagaraj <prasad.nagaraj76 at gmail.com> schrieb am
> > > > > 21.08.2018 um 11:42 in
> >
> > Nachricht
> > <CAHbCUJ0zdvpYALCR7tbnGgb8qrZHh8uDjE+RsnkoewvmFb8wAg at mail.gmail.com>:
> > > Hi Ken - Thanks for you response.
> > >
> > > We do have seen messages in other cases like
> > > corosync [MAIN  ] Corosync main process was not scheduled for
> > > 17314.4746 ms
> > > (threshold is 8000.0000 ms). Consider token timeout increase.
> > > corosync [TOTEM ] A processor failed, forming new configuration.
> > >
> > > Is this the indication of a failure due to CPU load issues and will
> > > this
> > > get resolved if I upgrade to Corosync 2.x series ?
>
> Yes, most definitely this is a CPU issue. It means corosync isn't
> getting enough CPU cycles to handle the cluster token before the
> timeout is reached.
>
> Upgrading may indeed help, as recent versions ensure that corosync runs
> with real-time priority in the kernel, and thus are more likely to get
> CPU time when something of lower priority is consuming all the CPU.
>
> But of course, there is some underlying problem that should be
> identified and addressed. Figure out what's maxing out the CPU or I/O.
> Ulrich's monitoring suggestion is a good start.
>
> > Hi!
> >
> > I'd strongly recommend starting monitoring on your nodes, at least
> > until you know what's going on. The good old UNIX sa (sysstat
> > package) could be a starting point. I'd monitor CPU idle
> > specifically. Then go for 100% device utilization, then look for
> > network bottlenecks...
> >
> > A new corosync release cannot fix those, most likely.
> >
> > Regards,
> > Ulrich
> >
> > >
> > > In any case, for the current scenario, we did not see any
> > > scheduling
> > > related messages.
> > >
> > > Thanks for your help.
> > > Prasad
> > >
> > > On Mon, Aug 20, 2018 at 7:57 PM, Ken Gaillot <kgaillot at redhat.com>
> > > wrote:
> > >
> > > > On Sun, 2018-08-19 at 17:35 +0530, Prasad Nagaraj wrote:
> > > > > Hi:
> > > > >
> > > > > One of these days, I saw a spurious node loss on my 3-node
> > > > > corosync
> > > > > cluster with following logged in the corosync.log of one of the
> > > > > nodes.
> > > > >
> > > > > Aug 18 12:40:25 corosync [pcmk  ] notice: pcmk_peer_update:
> > > > > Transitional membership event on ring 32: memb=2, new=0, lost=1
> > > > > Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: memb:
> > > > > vm02d780875f 67114156
> > > > > Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: memb:
> > > > > vmfa2757171f 151000236
> > > > > Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: lost:
> > > > > vm728316982d 201331884
> > > > > Aug 18 12:40:25 corosync [pcmk  ] notice: pcmk_peer_update:
> > > > > Stable
> > > > > membership event on ring 32: memb=2, new=0, lost=0
> > > > > Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
> > > > > vm02d780875f 67114156
> > > > > Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
> > > > > vmfa2757171f 151000236
> > > > > Aug 18 12:40:25 corosync [pcmk  ] info:
> > > > > ais_mark_unseen_peer_dead:
> > > > > Node vm728316982d was not seen in the previous transition
> > > > > Aug 18 12:40:25 corosync [pcmk  ] info: update_member: Node
> > > > > 201331884/vm728316982d is now: lost
> > > > > Aug 18 12:40:25 corosync [pcmk  ] info:
> > > > > send_member_notification:
> > > > > Sending membership update 32 to 3 children
> > > > > Aug 18 12:40:25 corosync [TOTEM ] A processor joined or left
> > > > > the
> > > > > membership and a new membership was formed.
> > > > > Aug 18 12:40:25 [4544] vmfa2757171f stonith-ng:     info:
> > > > > plugin_handle_membership:     Membership 32: quorum retained
> > > > > Aug 18 12:40:25 [4544] vmfa2757171f stonith-ng:   notice:
> > > > > crm_update_peer_state_iter:   plugin_handle_membership: Node
> > > > > vm728316982d[201331884] - state is now lost (was member)
> > > > > Aug 18 12:40:25 [4548] vmfa2757171f       crmd:     info:
> > > > > plugin_handle_membership:     Membership 32: quorum retained
> > > > > Aug 18 12:40:25 [4548] vmfa2757171f       crmd:   notice:
> > > > > crm_update_peer_state_iter:   plugin_handle_membership: Node
> > > > > vm728316982d[201331884] - state is now lost (was member)
> > > > > Aug 18 12:40:25 [4548] vmfa2757171f       crmd:     info:
> > > > > peer_update_callback: vm728316982d is now lost (was member)
> > > > > Aug 18 12:40:25 [4548] vmfa2757171f       crmd:  warning:
> > > > > match_down_event:     No match for shutdown action on
> > > > > vm728316982d
> > > > > Aug 18 12:40:25 [4548] vmfa2757171f       crmd:   notice:
> > > > > peer_update_callback: Stonith/shutdown of vm728316982d not
> > > > > matched
> > > > > Aug 18 12:40:25 [4548] vmfa2757171f       crmd:     info:
> > > > > crm_update_peer_join: peer_update_callback: Node
> > > > > vm728316982d[201331884] - join-6 phase 4 -> 0
> > > > > Aug 18 12:40:25 [4548] vmfa2757171f       crmd:     info:
> > > > > abort_transition_graph:       Transition aborted: Node failure
> > > > > (source=peer_update_callback:240, 1)
> > > > > Aug 18 12:40:25 [4543] vmfa2757171f        cib:     info:
> > > > > plugin_handle_membership:     Membership 32: quorum retained
> > > > > Aug 18 12:40:25 [4543] vmfa2757171f        cib:   notice:
> > > > > crm_update_peer_state_iter:   plugin_handle_membership: Node
> > > > > vm728316982d[201331884] - state is now lost (was member)
> > > > > Aug 18 12:40:25 [4543] vmfa2757171f        cib:   notice:
> > > > > crm_reap_dead_member: Removing vm728316982d/201331884 from the
> > > > > membership list
> > > > > Aug 18 12:40:25 [4543] vmfa2757171f        cib:   notice:
> > > > > reap_crm_member:      Purged 1 peers with id=201331884 and/or
> > > > > uname=vm728316982d from the membership cache
> > > > > Aug 18 12:40:25 [4544] vmfa2757171f stonith-ng:   notice:
> > > > > crm_reap_dead_member: Removing vm728316982d/201331884 from the
> > > > > membership list
> > > > > Aug 18 12:40:25 [4544] vmfa2757171f stonith-ng:   notice:
> > > > > reap_crm_member:      Purged 1 peers with id=201331884 and/or
> > > > > uname=vm728316982d from the membership cache
> > > > >
> > > > > However, within seconds, the node was able to join back.
> > > > >
> > > > > Aug 18 12:40:34 corosync [pcmk  ] notice: pcmk_peer_update:
> > > > > Stable
> > > > > membership event on ring 36: memb=3, new=1, lost=0
> > > > > Aug 18 12:40:34 corosync [pcmk  ] info: update_member: Node
> > > > > 201331884/vm728316982d is now: member
> > > > > Aug 18 12:40:34 corosync [pcmk  ] info: pcmk_peer_update: NEW:
> > > > > vm728316982d 201331884
> > > > >
> > > > >
> > > > > But this was enough time for the cluster to get into split
> > > > > brain kind
> > > > > of situation with  a resource on the node vm728316982d being
> > > > > stopped
> > > > > because of this node loss detection.
> > > > >
> > > > > Could anyone help whether this could happen due to any
> > > > > transient
> > > > > network distortion or so ?
> > > > > Are there any configuration settings that can be applied in
> > > > > corosync.conf so that cluster is more resilient to such
> > > > > temporary
> > > > > distortions.
> > > >
> > > > Your corosync sensitivity of 10-second token timeout and 10
> > > > retransimissions is already very lengthy -- likely the node was
> > > > already
> > > > unresponsive for more than 10 seconds before the first message
> > > > above,
> > > > so it was more than 18 seconds before it rejoined.
> > > >
> > > > It's rarely a good idea to change
> > > > token_retransmits_before_loss_const;
> > > > changing token is generally enough to deal with transient network
> > > > unreliability. However 18 seconds is a really long time to raise
> > > > the
> > > > token to, and it's uncertain from the information here whether
> > > > the root
> > > > cause was networking or something on the host.
> > > >
> > > > I notice your configuration is corosync 1 with the pacemaker
> > > > plugin;
> > > > that is a long-deprecated setup, and corosync 3 is about to come
> > > > out,
> > > > so you may want to consider upgrading to at least corosync 2 and
> > > > a
> > > > reasonably recent pacemaker. That would give you some reliability
> > > > improvements, including real-time priority scheduling of
> > > > corosync,
> > > > which could have been the issue here if CPU load rather than
> > > > networking
> > > > was the root cause.
> > > >
> > > > >
> > > > > Currently my corosync.conf looks like this :
> > > > >
> > > > > compatibility: whitetank
> > > > > totem {
> > > > >     version: 2
> > > > >     secauth: on
> > > > >     threads: 0
> > > > >     interface {
> > > > >     member {
> > > > >             memberaddr: 172.20.0.4
> > > > >         }
> > > > > member {
> > > > >             memberaddr: 172.20.0.9
> > > > >         }
> > > > > member {
> > > > >             memberaddr: 172.20.0.12
> > > > >         }
> > > > >
> > > > >     bindnetaddr: 172.20.0.12
> > > > >
> > > > >     ringnumber: 0
> > > > >     mcastport: 5405
> > > > >     ttl: 1
> > > > >     }
> > > > >     transport: udpu
> > > > >     token: 10000
> > > > >     token_retransmits_before_loss_const: 10
> > > > > }
> > > > >
> > > > > logging {
> > > > >     fileline: off
> > > > >     to_stderr: yes
> > > > >     to_logfile: yes
> > > > >     to_syslog: no
> > > > >     logfile: /var/log/cluster/corosync.log
> > > > >     timestamp: on
> > > > >     logger_subsys {
> > > > >     subsys: AMF
> > > > >     debug: off
> > > > >     }
> > > > > }
> > > > > service {
> > > > >     name: pacemaker
> > > > >     ver: 1
> > > > > }
> > > > > amf {
> > > > >     mode: disabled
> > > > > }
> > > > >
> > > > > Thanks in advance for the help.
> > > > > Prasad
> > > > >
> > > > > _______________________________________________
> > > > > Users mailing list: Users at clusterlabs.org
> > > > > https://lists.clusterlabs.org/mailman/listinfo/users
> > > > >
> > > > > Project Home: http://www.clusterlabs.org
> > > > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Sc
> > > > > ratch.
> > > > > pdf
> > > > > Bugs: http://bugs.clusterlabs.org
> > > >
> > > > --
> > > > Ken Gaillot <kgaillot at redhat.com>
> > > > _______________________________________________
> > > > Users mailing list: Users at clusterlabs.org
> > > > https://lists.clusterlabs.org/mailman/listinfo/users
> > > >
> > > > Project Home: http://www.clusterlabs.org
> > > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scra
> > > > tch.pdf
> > > > Bugs: http://bugs.clusterlabs.org
> > > >
> >
> >
> >
> > _______________________________________________
> > Users mailing list: Users at clusterlabs.org
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.
> > pdf
> > Bugs: http://bugs.clusterlabs.org
> --
> Ken Gaillot <kgaillot at redhat.com>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20180822/5851bd7b/attachment-0001.html>