[ClusterLabs] Antw: Re: Antw: Re: Spurious node loss in corosync cluster
Ulrich Windl
Ulrich.Windl at rz.uni-regensburg.de
Thu Aug 23 07:19:34 EDT 2018
>>> Prasad Nagaraj <prasad.nagaraj76 at gmail.com> schrieb am 22.08.2018 um 02:59 in
Nachricht
<CAHbCUJ2ehJg13=V8SmKCg4eC8nNArzjfs8si+LtFJTryBAzf8w at mail.gmail.com>:
> Thanks Ken and Ulrich. There is definitely high IO on the system with
> sometimes IOWAIT s of upto 90%
> I have come across some previous posts that IOWAIT is also considered as
> CPU load by Corosync. Is this true ? Does having high IO may lead corosync
It's not Corosync, ist Linux: A process busy with I/O also adds to the (CPU) load. One typ8ical effect tha twe see if some stale NFS or CIFS share is still being used, the "CPU load" goes up...
> complain as in " Corosync main process was not scheduled for..." or "High
> CPU load detected.." ?
>
> I will surely monitor the system more.
As recommanded try sar's disk activity (the numbers below are from a fast disk sytem, BTW):
00:00:01 DEV tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util
08:40:01 dev253-13 1225.65 18393.28 1469.77 16.21 0.53 0.44 0.33 40.76
08:50:01 dev253-51 4972.41 38796.90 977.88 8.00 2.26 0.46 0.11 55.19
09:10:01 dev253-51 4709.03 36692.07 975.01 8.00 2.73 0.58 0.14 64.57
09:20:01 dev253-51 4445.17 34708.88 847.96 8.00 1.70 0.38 0.12 55.03
10:10:01 dev253-51 4246.66 32944.55 1023.61 8.00 3.12 0.73 0.18 77.83
11:00:01 dev253-51 5500.39 42984.68 1012.82 8.00 4.55 0.83 0.14 76.91
19:50:01 dev253-51 49618.88 396396.53 547.83 8.00 139.60 2.81 0.01 60.98
The %util is the column too look at; you could also look at await, but note that network-related I/O is not included there.
Regards,
Ulrich
>
> Thanks for your help.
> Prasad
>
>
>
> On Tue, Aug 21, 2018 at 9:07 PM, Ken Gaillot <kgaillot at redhat.com> wrote:
>
>> On Tue, 2018-08-21 at 15:29 +0200, Ulrich Windl wrote:
>> > > > > Prasad Nagaraj <prasad.nagaraj76 at gmail.com> schrieb am
>> > > > > 21.08.2018 um 11:42 in
>> >
>> > Nachricht
>> > <CAHbCUJ0zdvpYALCR7tbnGgb8qrZHh8uDjE+RsnkoewvmFb8wAg at mail.gmail.com>:
>> > > Hi Ken - Thanks for you response.
>> > >
>> > > We do have seen messages in other cases like
>> > > corosync [MAIN ] Corosync main process was not scheduled for
>> > > 17314.4746 ms
>> > > (threshold is 8000.0000 ms). Consider token timeout increase.
>> > > corosync [TOTEM ] A processor failed, forming new configuration.
>> > >
>> > > Is this the indication of a failure due to CPU load issues and will
>> > > this
>> > > get resolved if I upgrade to Corosync 2.x series ?
>>
>> Yes, most definitely this is a CPU issue. It means corosync isn't
>> getting enough CPU cycles to handle the cluster token before the
>> timeout is reached.
>>
>> Upgrading may indeed help, as recent versions ensure that corosync runs
>> with real-time priority in the kernel, and thus are more likely to get
>> CPU time when something of lower priority is consuming all the CPU.
>>
>> But of course, there is some underlying problem that should be
>> identified and addressed. Figure out what's maxing out the CPU or I/O.
>> Ulrich's monitoring suggestion is a good start.
>>
>> > Hi!
>> >
>> > I'd strongly recommend starting monitoring on your nodes, at least
>> > until you know what's going on. The good old UNIX sa (sysstat
>> > package) could be a starting point. I'd monitor CPU idle
>> > specifically. Then go for 100% device utilization, then look for
>> > network bottlenecks...
>> >
>> > A new corosync release cannot fix those, most likely.
>> >
>> > Regards,
>> > Ulrich
>> >
>> > >
>> > > In any case, for the current scenario, we did not see any
>> > > scheduling
>> > > related messages.
>> > >
>> > > Thanks for your help.
>> > > Prasad
>> > >
>> > > On Mon, Aug 20, 2018 at 7:57 PM, Ken Gaillot <kgaillot at redhat.com>
>> > > wrote:
>> > >
>> > > > On Sun, 2018-08-19 at 17:35 +0530, Prasad Nagaraj wrote:
>> > > > > Hi:
>> > > > >
>> > > > > One of these days, I saw a spurious node loss on my 3-node
>> > > > > corosync
>> > > > > cluster with following logged in the corosync.log of one of the
>> > > > > nodes.
>> > > > >
>> > > > > Aug 18 12:40:25 corosync [pcmk ] notice: pcmk_peer_update:
>> > > > > Transitional membership event on ring 32: memb=2, new=0, lost=1
>> > > > > Aug 18 12:40:25 corosync [pcmk ] info: pcmk_peer_update: memb:
>> > > > > vm02d780875f 67114156
>> > > > > Aug 18 12:40:25 corosync [pcmk ] info: pcmk_peer_update: memb:
>> > > > > vmfa2757171f 151000236
>> > > > > Aug 18 12:40:25 corosync [pcmk ] info: pcmk_peer_update: lost:
>> > > > > vm728316982d 201331884
>> > > > > Aug 18 12:40:25 corosync [pcmk ] notice: pcmk_peer_update:
>> > > > > Stable
>> > > > > membership event on ring 32: memb=2, new=0, lost=0
>> > > > > Aug 18 12:40:25 corosync [pcmk ] info: pcmk_peer_update: MEMB:
>> > > > > vm02d780875f 67114156
>> > > > > Aug 18 12:40:25 corosync [pcmk ] info: pcmk_peer_update: MEMB:
>> > > > > vmfa2757171f 151000236
>> > > > > Aug 18 12:40:25 corosync [pcmk ] info:
>> > > > > ais_mark_unseen_peer_dead:
>> > > > > Node vm728316982d was not seen in the previous transition
>> > > > > Aug 18 12:40:25 corosync [pcmk ] info: update_member: Node
>> > > > > 201331884/vm728316982d is now: lost
>> > > > > Aug 18 12:40:25 corosync [pcmk ] info:
>> > > > > send_member_notification:
>> > > > > Sending membership update 32 to 3 children
>> > > > > Aug 18 12:40:25 corosync [TOTEM ] A processor joined or left
>> > > > > the
>> > > > > membership and a new membership was formed.
>> > > > > Aug 18 12:40:25 [4544] vmfa2757171f stonith-ng: info:
>> > > > > plugin_handle_membership: Membership 32: quorum retained
>> > > > > Aug 18 12:40:25 [4544] vmfa2757171f stonith-ng: notice:
>> > > > > crm_update_peer_state_iter: plugin_handle_membership: Node
>> > > > > vm728316982d[201331884] - state is now lost (was member)
>> > > > > Aug 18 12:40:25 [4548] vmfa2757171f crmd: info:
>> > > > > plugin_handle_membership: Membership 32: quorum retained
>> > > > > Aug 18 12:40:25 [4548] vmfa2757171f crmd: notice:
>> > > > > crm_update_peer_state_iter: plugin_handle_membership: Node
>> > > > > vm728316982d[201331884] - state is now lost (was member)
>> > > > > Aug 18 12:40:25 [4548] vmfa2757171f crmd: info:
>> > > > > peer_update_callback: vm728316982d is now lost (was member)
>> > > > > Aug 18 12:40:25 [4548] vmfa2757171f crmd: warning:
>> > > > > match_down_event: No match for shutdown action on
>> > > > > vm728316982d
>> > > > > Aug 18 12:40:25 [4548] vmfa2757171f crmd: notice:
>> > > > > peer_update_callback: Stonith/shutdown of vm728316982d not
>> > > > > matched
>> > > > > Aug 18 12:40:25 [4548] vmfa2757171f crmd: info:
>> > > > > crm_update_peer_join: peer_update_callback: Node
>> > > > > vm728316982d[201331884] - join-6 phase 4 -> 0
>> > > > > Aug 18 12:40:25 [4548] vmfa2757171f crmd: info:
>> > > > > abort_transition_graph: Transition aborted: Node failure
>> > > > > (source=peer_update_callback:240, 1)
>> > > > > Aug 18 12:40:25 [4543] vmfa2757171f cib: info:
>> > > > > plugin_handle_membership: Membership 32: quorum retained
>> > > > > Aug 18 12:40:25 [4543] vmfa2757171f cib: notice:
>> > > > > crm_update_peer_state_iter: plugin_handle_membership: Node
>> > > > > vm728316982d[201331884] - state is now lost (was member)
>> > > > > Aug 18 12:40:25 [4543] vmfa2757171f cib: notice:
>> > > > > crm_reap_dead_member: Removing vm728316982d/201331884 from the
>> > > > > membership list
>> > > > > Aug 18 12:40:25 [4543] vmfa2757171f cib: notice:
>> > > > > reap_crm_member: Purged 1 peers with id=201331884 and/or
>> > > > > uname=vm728316982d from the membership cache
>> > > > > Aug 18 12:40:25 [4544] vmfa2757171f stonith-ng: notice:
>> > > > > crm_reap_dead_member: Removing vm728316982d/201331884 from the
>> > > > > membership list
>> > > > > Aug 18 12:40:25 [4544] vmfa2757171f stonith-ng: notice:
>> > > > > reap_crm_member: Purged 1 peers with id=201331884 and/or
>> > > > > uname=vm728316982d from the membership cache
>> > > > >
>> > > > > However, within seconds, the node was able to join back.
>> > > > >
>> > > > > Aug 18 12:40:34 corosync [pcmk ] notice: pcmk_peer_update:
>> > > > > Stable
>> > > > > membership event on ring 36: memb=3, new=1, lost=0
>> > > > > Aug 18 12:40:34 corosync [pcmk ] info: update_member: Node
>> > > > > 201331884/vm728316982d is now: member
>> > > > > Aug 18 12:40:34 corosync [pcmk ] info: pcmk_peer_update: NEW:
>> > > > > vm728316982d 201331884
>> > > > >
>> > > > >
>> > > > > But this was enough time for the cluster to get into split
>> > > > > brain kind
>> > > > > of situation with a resource on the node vm728316982d being
>> > > > > stopped
>> > > > > because of this node loss detection.
>> > > > >
>> > > > > Could anyone help whether this could happen due to any
>> > > > > transient
>> > > > > network distortion or so ?
>> > > > > Are there any configuration settings that can be applied in
>> > > > > corosync.conf so that cluster is more resilient to such
>> > > > > temporary
>> > > > > distortions.
>> > > >
>> > > > Your corosync sensitivity of 10-second token timeout and 10
>> > > > retransimissions is already very lengthy -- likely the node was
>> > > > already
>> > > > unresponsive for more than 10 seconds before the first message
>> > > > above,
>> > > > so it was more than 18 seconds before it rejoined.
>> > > >
>> > > > It's rarely a good idea to change
>> > > > token_retransmits_before_loss_const;
>> > > > changing token is generally enough to deal with transient network
>> > > > unreliability. However 18 seconds is a really long time to raise
>> > > > the
>> > > > token to, and it's uncertain from the information here whether
>> > > > the root
>> > > > cause was networking or something on the host.
>> > > >
>> > > > I notice your configuration is corosync 1 with the pacemaker
>> > > > plugin;
>> > > > that is a long-deprecated setup, and corosync 3 is about to come
>> > > > out,
>> > > > so you may want to consider upgrading to at least corosync 2 and
>> > > > a
>> > > > reasonably recent pacemaker. That would give you some reliability
>> > > > improvements, including real-time priority scheduling of
>> > > > corosync,
>> > > > which could have been the issue here if CPU load rather than
>> > > > networking
>> > > > was the root cause.
>> > > >
>> > > > >
>> > > > > Currently my corosync.conf looks like this :
>> > > > >
>> > > > > compatibility: whitetank
>> > > > > totem {
>> > > > > version: 2
>> > > > > secauth: on
>> > > > > threads: 0
>> > > > > interface {
>> > > > > member {
>> > > > > memberaddr: 172.20.0.4
>> > > > > }
>> > > > > member {
>> > > > > memberaddr: 172.20.0.9
>> > > > > }
>> > > > > member {
>> > > > > memberaddr: 172.20.0.12
>> > > > > }
>> > > > >
>> > > > > bindnetaddr: 172.20.0.12
>> > > > >
>> > > > > ringnumber: 0
>> > > > > mcastport: 5405
>> > > > > ttl: 1
>> > > > > }
>> > > > > transport: udpu
>> > > > > token: 10000
>> > > > > token_retransmits_before_loss_const: 10
>> > > > > }
>> > > > >
>> > > > > logging {
>> > > > > fileline: off
>> > > > > to_stderr: yes
>> > > > > to_logfile: yes
>> > > > > to_syslog: no
>> > > > > logfile: /var/log/cluster/corosync.log
>> > > > > timestamp: on
>> > > > > logger_subsys {
>> > > > > subsys: AMF
>> > > > > debug: off
>> > > > > }
>> > > > > }
>> > > > > service {
>> > > > > name: pacemaker
>> > > > > ver: 1
>> > > > > }
>> > > > > amf {
>> > > > > mode: disabled
>> > > > > }
>> > > > >
>> > > > > Thanks in advance for the help.
>> > > > > Prasad
>> > > > >
>> > > > > _______________________________________________
>> > > > > Users mailing list: Users at clusterlabs.org
>> > > > > https://lists.clusterlabs.org/mailman/listinfo/users
>> > > > >
>> > > > > Project Home: http://www.clusterlabs.org
>> > > > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Sc
>> > > > > ratch.
>> > > > > pdf
>> > > > > Bugs: http://bugs.clusterlabs.org
>> > > >
>> > > > --
>> > > > Ken Gaillot <kgaillot at redhat.com>
>> > > > _______________________________________________
>> > > > Users mailing list: Users at clusterlabs.org
>> > > > https://lists.clusterlabs.org/mailman/listinfo/users
>> > > >
>> > > > Project Home: http://www.clusterlabs.org
>> > > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scra
>> > > > tch.pdf
>> > > > Bugs: http://bugs.clusterlabs.org
>> > > >
>> >
>> >
>> >
>> > _______________________________________________
>> > Users mailing list: Users at clusterlabs.org
>> > https://lists.clusterlabs.org/mailman/listinfo/users
>> >
>> > Project Home: http://www.clusterlabs.org
>> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.
>> > pdf
>> > Bugs: http://bugs.clusterlabs.org
>> --
>> Ken Gaillot <kgaillot at redhat.com>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
More information about the Users
mailing list