<div dir="ltr">Thanks Ken and Ulrich. There is definitely high IO on the system with sometimes IOWAIT s of upto 90%<div>I have come across some previous posts that IOWAIT is also considered as CPU load by Corosync. Is this true ? Does having high IO may lead corosync complain as in "


<span style="font-size:small;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial;float:none;display:inline">Corosync main process was not scheduled for..." or "High CPU load detected.." ?</span><br></div><div><span style="font-size:small;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial;float:none;display:inline"><br></span></div><div><span style="font-size:small;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial;float:none;display:inline">I will surely monitor the system more.</span></div><div><span style="font-size:small;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial;float:none;display:inline"><br></span></div><div><span style="font-size:small;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial;float:none;display:inline">Thanks for your help.</span></div><div><span style="font-size:small;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial;float:none;display:inline">Prasad</span></div><div><span style="font-size:small;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial;float:none;display:inline"><br></span></div><div><span style="font-size:small;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial;float:none;display:inline"><br></span></div><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Aug 21, 2018 at 9:07 PM, Ken Gaillot <span dir="ltr"><<a href="mailto:kgaillot@redhat.com" target="_blank">kgaillot@redhat.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">On Tue, 2018-08-21 at 15:29 +0200, Ulrich Windl wrote:<br>

> > > > Prasad Nagaraj <<a href="mailto:prasad.nagaraj76@gmail.com">prasad.nagaraj76@gmail.com</a>> schrieb am<br>

> > > > 21.08.2018 um 11:42 in<br>

> <br>

> Nachricht<br>

> <<a href="mailto:CAHbCUJ0zdvpYALCR7tbnGgb8qrZHh8uDjE%2BRsnkoewvmFb8wAg@mail.gmail.com">CAHbCUJ0zdvpYALCR7tbnGgb8qrZH<wbr>h8uDjE+RsnkoewvmFb8wAg@mail.<wbr>gmail.com</a>>:<br>

> > Hi Ken - Thanks for you response.<br>

> > <br>

> > We do have seen messages in other cases like<br>

> > corosync [MAIN  ] Corosync main process was not scheduled for<br>

> > 17314.4746 ms<br>

> > (threshold is 8000.0000 ms). Consider token timeout increase.<br>

> > corosync [TOTEM ] A processor failed, forming new configuration.<br>

> > <br>

> > Is this the indication of a failure due to CPU load issues and will<br>

> > this<br>

> > get resolved if I upgrade to Corosync 2.x series ?<br>

<br>

</span>Yes, most definitely this is a CPU issue. It means corosync isn't<br>

getting enough CPU cycles to handle the cluster token before the<br>

timeout is reached.<br>

<br>

Upgrading may indeed help, as recent versions ensure that corosync runs<br>

with real-time priority in the kernel, and thus are more likely to get<br>

CPU time when something of lower priority is consuming all the CPU.<br>

<br>

But of course, there is some underlying problem that should be<br>

identified and addressed. Figure out what's maxing out the CPU or I/O.<br>

Ulrich's monitoring suggestion is a good start.<br>

<div class="HOEnZb"><div class="h5"><br>

> Hi!<br>

> <br>

> I'd strongly recommend starting monitoring on your nodes, at least<br>

> until you know what's going on. The good old UNIX sa (sysstat<br>

> package) could be a starting point. I'd monitor CPU idle<br>

> specifically. Then go for 100% device utilization, then look for<br>

> network bottlenecks...<br>

> <br>

> A new corosync release cannot fix those, most likely.<br>

> <br>

> Regards,<br>

> Ulrich<br>

> <br>

> > <br>

> > In any case, for the current scenario, we did not see any<br>

> > scheduling<br>

> > related messages.<br>

> > <br>

> > Thanks for your help.<br>

> > Prasad<br>

> > <br>

> > On Mon, Aug 20, 2018 at 7:57 PM, Ken Gaillot <<a href="mailto:kgaillot@redhat.com">kgaillot@redhat.com</a>><br>

> > wrote:<br>

> > <br>

> > > On Sun, 2018-08-19 at 17:35 +0530, Prasad Nagaraj wrote:<br>

> > > > Hi:<br>

> > > > <br>

> > > > One of these days, I saw a spurious node loss on my 3-node<br>

> > > > corosync<br>

> > > > cluster with following logged in the corosync.log of one of the<br>

> > > > nodes.<br>

> > > > <br>

> > > > Aug 18 12:40:25 corosync [pcmk  ] notice: pcmk_peer_update:<br>

> > > > Transitional membership event on ring 32: memb=2, new=0, lost=1<br>

> > > > Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: memb:<br>

> > > > vm02d780875f 67114156<br>

> > > > Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: memb:<br>

> > > > vmfa2757171f 151000236<br>

> > > > Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: lost:<br>

> > > > vm728316982d 201331884<br>

> > > > Aug 18 12:40:25 corosync [pcmk  ] notice: pcmk_peer_update:<br>

> > > > Stable<br>

> > > > membership event on ring 32: memb=2, new=0, lost=0<br>

> > > > Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: MEMB:<br>

> > > > vm02d780875f 67114156<br>

> > > > Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: MEMB:<br>

> > > > vmfa2757171f 151000236<br>

> > > > Aug 18 12:40:25 corosync [pcmk  ] info:<br>

> > > > ais_mark_unseen_peer_dead:<br>

> > > > Node vm728316982d was not seen in the previous transition<br>

> > > > Aug 18 12:40:25 corosync [pcmk  ] info: update_member: Node<br>

> > > > 201331884/vm728316982d is now: lost<br>

> > > > Aug 18 12:40:25 corosync [pcmk  ] info:<br>

> > > > send_member_notification:<br>

> > > > Sending membership update 32 to 3 children<br>

> > > > Aug 18 12:40:25 corosync [TOTEM ] A processor joined or left<br>

> > > > the<br>

> > > > membership and a new membership was formed.<br>

> > > > Aug 18 12:40:25 [4544] vmfa2757171f stonith-ng:     info:<br>

> > > > plugin_handle_membership:     <wbr>Membership 32: quorum retained<br>

> > > > Aug 18 12:40:25 [4544] vmfa2757171f stonith-ng:   notice:<br>

> > > > crm_update_peer_state_iter:   <wbr>plugin_handle_membership: Node<br>

> > > > vm728316982d[201331884] - state is now lost (was member)<br>

> > > > Aug 18 12:40:25 [4548] vmfa2757171f       crmd:     <wbr>info:<br>

> > > > plugin_handle_membership:     <wbr>Membership 32: quorum retained<br>

> > > > Aug 18 12:40:25 [4548] vmfa2757171f       crmd:   <wbr>notice:<br>

> > > > crm_update_peer_state_iter:   <wbr>plugin_handle_membership: Node<br>

> > > > vm728316982d[201331884] - state is now lost (was member)<br>

> > > > Aug 18 12:40:25 [4548] vmfa2757171f       crmd:     <wbr>info:<br>

> > > > peer_update_callback: vm728316982d is now lost (was member)<br>

> > > > Aug 18 12:40:25 [4548] vmfa2757171f       crmd:  <wbr>warning:<br>

> > > > match_down_event:     No match for shutdown action on<br>

> > > > vm728316982d<br>

> > > > Aug 18 12:40:25 [4548] vmfa2757171f       crmd:   <wbr>notice:<br>

> > > > peer_update_callback: Stonith/shutdown of vm728316982d not<br>

> > > > matched<br>

> > > > Aug 18 12:40:25 [4548] vmfa2757171f       crmd:     <wbr>info:<br>

> > > > crm_update_peer_join: peer_update_callback: Node<br>

> > > > vm728316982d[201331884] - join-6 phase 4 -> 0<br>

> > > > Aug 18 12:40:25 [4548] vmfa2757171f       crmd:     <wbr>info:<br>

> > > > abort_transition_graph:       <wbr>Transition aborted: Node failure<br>

> > > > (source=peer_update_callback:<wbr>240, 1)<br>

> > > > Aug 18 12:40:25 [4543] vmfa2757171f        cib:     <wbr>info:<br>

> > > > plugin_handle_membership:     <wbr>Membership 32: quorum retained<br>

> > > > Aug 18 12:40:25 [4543] vmfa2757171f        cib:   <wbr>notice:<br>

> > > > crm_update_peer_state_iter:   <wbr>plugin_handle_membership: Node<br>

> > > > vm728316982d[201331884] - state is now lost (was member)<br>

> > > > Aug 18 12:40:25 [4543] vmfa2757171f        cib:   <wbr>notice:<br>

> > > > crm_reap_dead_member: Removing vm728316982d/201331884 from the<br>

> > > > membership list<br>

> > > > Aug 18 12:40:25 [4543] vmfa2757171f        cib:   <wbr>notice:<br>

> > > > reap_crm_member:      Purged 1 peers with id=201331884 and/or<br>

> > > > uname=vm728316982d from the membership cache<br>

> > > > Aug 18 12:40:25 [4544] vmfa2757171f stonith-ng:   notice:<br>

> > > > crm_reap_dead_member: Removing vm728316982d/201331884 from the<br>

> > > > membership list<br>

> > > > Aug 18 12:40:25 [4544] vmfa2757171f stonith-ng:   notice:<br>

> > > > reap_crm_member:      Purged 1 peers with id=201331884 and/or<br>

> > > > uname=vm728316982d from the membership cache<br>

> > > > <br>

> > > > However, within seconds, the node was able to join back.<br>

> > > > <br>

> > > > Aug 18 12:40:34 corosync [pcmk  ] notice: pcmk_peer_update:<br>

> > > > Stable<br>

> > > > membership event on ring 36: memb=3, new=1, lost=0<br>

> > > > Aug 18 12:40:34 corosync [pcmk  ] info: update_member: Node<br>

> > > > 201331884/vm728316982d is now: member<br>

> > > > Aug 18 12:40:34 corosync [pcmk  ] info: pcmk_peer_update: NEW:<br>

> > > > vm728316982d 201331884<br>

> > > > <br>

> > > > <br>

> > > > But this was enough time for the cluster to get into split<br>

> > > > brain kind<br>

> > > > of situation with  a resource on the node vm728316982d being<br>

> > > > stopped<br>

> > > > because of this node loss detection.<br>

> > > > <br>

> > > > Could anyone help whether this could happen due to any<br>

> > > > transient<br>

> > > > network distortion or so ?<br>

> > > > Are there any configuration settings that can be applied in<br>

> > > > corosync.conf so that cluster is more resilient to such<br>

> > > > temporary<br>

> > > > distortions.<br>

> > > <br>

> > > Your corosync sensitivity of 10-second token timeout and 10<br>

> > > retransimissions is already very lengthy -- likely the node was<br>

> > > already<br>

> > > unresponsive for more than 10 seconds before the first message<br>

> > > above,<br>

> > > so it was more than 18 seconds before it rejoined.<br>

> > > <br>

> > > It's rarely a good idea to change<br>

> > > token_retransmits_before_loss_<wbr>const;<br>

> > > changing token is generally enough to deal with transient network<br>

> > > unreliability. However 18 seconds is a really long time to raise<br>

> > > the<br>

> > > token to, and it's uncertain from the information here whether<br>

> > > the root<br>

> > > cause was networking or something on the host.<br>

> > > <br>

> > > I notice your configuration is corosync 1 with the pacemaker<br>

> > > plugin;<br>

> > > that is a long-deprecated setup, and corosync 3 is about to come<br>

> > > out,<br>

> > > so you may want to consider upgrading to at least corosync 2 and<br>

> > > a<br>

> > > reasonably recent pacemaker. That would give you some reliability<br>

> > > improvements, including real-time priority scheduling of<br>

> > > corosync,<br>

> > > which could have been the issue here if CPU load rather than<br>

> > > networking<br>

> > > was the root cause.<br>

> > > <br>

> > > > <br>

> > > > Currently my corosync.conf looks like this :<br>

> > > > <br>

> > > > compatibility: whitetank<br>

> > > > totem {<br>

> > > >     version: 2<br>

> > > >     secauth: on<br>

> > > >     threads: 0<br>

> > > >     interface {<br>

> > > >     member {<br>

> > > >             memberaddr: 172.20.0.4<br>

> > > >         }<br>

> > > > member {<br>

> > > >             memberaddr: 172.20.0.9<br>

> > > >         }<br>

> > > > member {<br>

> > > >             memberaddr: 172.20.0.12<br>

> > > >         }<br>

> > > > <br>

> > > >     bindnetaddr: 172.20.0.12<br>

> > > > <br>

> > > >     ringnumber: 0<br>

> > > >     mcastport: 5405<br>

> > > >     ttl: 1<br>

> > > >     }<br>

> > > >     transport: udpu<br>

> > > >     token: 10000<br>

> > > >     token_retransmits_before_<wbr>loss_const: 10<br>

> > > > }<br>

> > > > <br>

> > > > logging {<br>

> > > >     fileline: off<br>

> > > >     to_stderr: yes<br>

> > > >     to_logfile: yes<br>

> > > >     to_syslog: no<br>

> > > >     logfile: /var/log/cluster/corosync.log<br>

> > > >     timestamp: on<br>

> > > >     logger_subsys {<br>

> > > >     subsys: AMF<br>

> > > >     debug: off<br>

> > > >     }<br>

> > > > }<br>

> > > > service {<br>

> > > >     name: pacemaker<br>

> > > >     ver: 1<br>

> > > > }<br>

> > > > amf {<br>

> > > >     mode: disabled<br>

> > > > }<br>

> > > > <br>

> > > > Thanks in advance for the help.<br>

> > > > Prasad<br>

> > > > <br>

> > > > ______________________________<wbr>_________________<br>

> > > > Users mailing list: <a href="mailto:Users@clusterlabs.org">Users@clusterlabs.org</a> <br>

> > > > <a href="https://lists.clusterlabs.org/mailman/listinfo/users" rel="noreferrer" target="_blank">https://lists.clusterlabs.org/<wbr>mailman/listinfo/users</a> <br>

> > > > <br>

> > > > Project Home: <a href="http://www.clusterlabs.org" rel="noreferrer" target="_blank">http://www.clusterlabs.org</a> <br>

> > > > Getting started: <a href="http://www.clusterlabs.org/doc/Cluster_from_Sc" rel="noreferrer" target="_blank">http://www.clusterlabs.org/<wbr>doc/Cluster_from_Sc</a><br>

> > > > ratch.<br>

> > > > pdf<br>

> > > > Bugs: <a href="http://bugs.clusterlabs.org" rel="noreferrer" target="_blank">http://bugs.clusterlabs.org</a> <br>

> > > <br>

> > > --<br>

> > > Ken Gaillot <<a href="mailto:kgaillot@redhat.com">kgaillot@redhat.com</a>><br>

> > > ______________________________<wbr>_________________<br>

> > > Users mailing list: <a href="mailto:Users@clusterlabs.org">Users@clusterlabs.org</a> <br>

> > > <a href="https://lists.clusterlabs.org/mailman/listinfo/users" rel="noreferrer" target="_blank">https://lists.clusterlabs.org/<wbr>mailman/listinfo/users</a> <br>

> > > <br>

> > > Project Home: <a href="http://www.clusterlabs.org" rel="noreferrer" target="_blank">http://www.clusterlabs.org</a> <br>

> > > Getting started: <a href="http://www.clusterlabs.org/doc/Cluster_from_Scra" rel="noreferrer" target="_blank">http://www.clusterlabs.org/<wbr>doc/Cluster_from_Scra</a><br>

> > > tch.pdf <br>

> > > Bugs: <a href="http://bugs.clusterlabs.org" rel="noreferrer" target="_blank">http://bugs.clusterlabs.org</a> <br>

> > > <br>

> <br>

> <br>

> <br>

> ______________________________<wbr>_________________<br>

> Users mailing list: <a href="mailto:Users@clusterlabs.org">Users@clusterlabs.org</a><br>

> <a href="https://lists.clusterlabs.org/mailman/listinfo/users" rel="noreferrer" target="_blank">https://lists.clusterlabs.org/<wbr>mailman/listinfo/users</a><br>

> <br>

> Project Home: <a href="http://www.clusterlabs.org" rel="noreferrer" target="_blank">http://www.clusterlabs.org</a><br>

> Getting started: <a href="http://www.clusterlabs.org/doc/Cluster_from_Scratch" rel="noreferrer" target="_blank">http://www.clusterlabs.org/<wbr>doc/Cluster_from_Scratch</a>.<br>

> pdf<br>

> Bugs: <a href="http://bugs.clusterlabs.org" rel="noreferrer" target="_blank">http://bugs.clusterlabs.org</a><br>

-- <br>

Ken Gaillot <<a href="mailto:kgaillot@redhat.com">kgaillot@redhat.com</a>><br>

______________________________<wbr>_________________<br>

Users mailing list: <a href="mailto:Users@clusterlabs.org">Users@clusterlabs.org</a><br>

<a href="https://lists.clusterlabs.org/mailman/listinfo/users" rel="noreferrer" target="_blank">https://lists.clusterlabs.org/<wbr>mailman/listinfo/users</a><br>

<br>

Project Home: <a href="http://www.clusterlabs.org" rel="noreferrer" target="_blank">http://www.clusterlabs.org</a><br>

Getting started: <a href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf" rel="noreferrer" target="_blank">http://www.clusterlabs.org/<wbr>doc/Cluster_from_Scratch.pdf</a><br>

Bugs: <a href="http://bugs.clusterlabs.org" rel="noreferrer" target="_blank">http://bugs.clusterlabs.org</a><br>

</div></div></blockquote></div><br></div></div>