[ClusterLabs] Antw: Re: Spurious node loss in corosync cluster

Wed Aug 22 06:25:43 UTC 2018

Prasad,

> Thanks Ken and Ulrich. There is definitely high IO on the system with
> sometimes IOWAIT s of upto 90%
> I have come across some previous posts that IOWAIT is also considered as
> CPU load by Corosync. Is this true ? Does having high IO may lead corosync
> complain as in " Corosync main process was not scheduled for..." or "Hi > CPU load detected.." ?

Yes it can.

Corosync never logs "Hi CPU load detected...".

> 
> I will surely monitor the system more.

Is that system VM or physical machine? Because " Corosync main process 
was not scheduled for..." is usually happening on VMs where hosts are 
highly overloaded.

Honza

> 
> Thanks for your help.
> Prasad
> 
> 
> 
> On Tue, Aug 21, 2018 at 9:07 PM, Ken Gaillot <kgaillot at redhat.com> wrote:
> 
>> On Tue, 2018-08-21 at 15:29 +0200, Ulrich Windl wrote:
>>>>>> Prasad Nagaraj <prasad.nagaraj76 at gmail.com> schrieb am
>>>>>> 21.08.2018 um 11:42 in
>>>
>>> Nachricht
>>> <CAHbCUJ0zdvpYALCR7tbnGgb8qrZHh8uDjE+RsnkoewvmFb8wAg at mail.gmail.com>:
>>>> Hi Ken - Thanks for you response.
>>>>
>>>> We do have seen messages in other cases like
>>>> corosync [MAIN  ] Corosync main process was not scheduled for
>>>> 17314.4746 ms
>>>> (threshold is 8000.0000 ms). Consider token timeout increase.
>>>> corosync [TOTEM ] A processor failed, forming new configuration.
>>>>
>>>> Is this the indication of a failure due to CPU load issues and will
>>>> this
>>>> get resolved if I upgrade to Corosync 2.x series ?
>>
>> Yes, most definitely this is a CPU issue. It means corosync isn't
>> getting enough CPU cycles to handle the cluster token before the
>> timeout is reached.
>>
>> Upgrading may indeed help, as recent versions ensure that corosync runs
>> with real-time priority in the kernel, and thus are more likely to get
>> CPU time when something of lower priority is consuming all the CPU.
>>
>> But of course, there is some underlying problem that should be
>> identified and addressed. Figure out what's maxing out the CPU or I/O.
>> Ulrich's monitoring suggestion is a good start.
>>
>>> Hi!
>>>
>>> I'd strongly recommend starting monitoring on your nodes, at least
>>> until you know what's going on. The good old UNIX sa (sysstat
>>> package) could be a starting point. I'd monitor CPU idle
>>> specifically. Then go for 100% device utilization, then look for
>>> network bottlenecks...
>>>
>>> A new corosync release cannot fix those, most likely.
>>>
>>> Regards,
>>> Ulrich
>>>
>>>>
>>>> In any case, for the current scenario, we did not see any
>>>> scheduling
>>>> related messages.
>>>>
>>>> Thanks for your help.
>>>> Prasad
>>>>
>>>> On Mon, Aug 20, 2018 at 7:57 PM, Ken Gaillot <kgaillot at redhat.com>
>>>> wrote:
>>>>
>>>>> On Sun, 2018-08-19 at 17:35 +0530, Prasad Nagaraj wrote:
>>>>>> Hi:
>>>>>>
>>>>>> One of these days, I saw a spurious node loss on my 3-node
>>>>>> corosync
>>>>>> cluster with following logged in the corosync.log of one of the
>>>>>> nodes.
>>>>>>
>>>>>> Aug 18 12:40:25 corosync [pcmk  ] notice: pcmk_peer_update:
>>>>>> Transitional membership event on ring 32: memb=2, new=0, lost=1
>>>>>> Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: memb:
>>>>>> vm02d780875f 67114156
>>>>>> Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: memb:
>>>>>> vmfa2757171f 151000236
>>>>>> Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: lost:
>>>>>> vm728316982d 201331884
>>>>>> Aug 18 12:40:25 corosync [pcmk  ] notice: pcmk_peer_update:
>>>>>> Stable
>>>>>> membership event on ring 32: memb=2, new=0, lost=0
>>>>>> Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
>>>>>> vm02d780875f 67114156
>>>>>> Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
>>>>>> vmfa2757171f 151000236
>>>>>> Aug 18 12:40:25 corosync [pcmk  ] info:
>>>>>> ais_mark_unseen_peer_dead:
>>>>>> Node vm728316982d was not seen in the previous transition
>>>>>> Aug 18 12:40:25 corosync [pcmk  ] info: update_member: Node
>>>>>> 201331884/vm728316982d is now: lost
>>>>>> Aug 18 12:40:25 corosync [pcmk  ] info:
>>>>>> send_member_notification:
>>>>>> Sending membership update 32 to 3 children
>>>>>> Aug 18 12:40:25 corosync [TOTEM ] A processor joined or left
>>>>>> the
>>>>>> membership and a new membership was formed.
>>>>>> Aug 18 12:40:25 [4544] vmfa2757171f stonith-ng:     info:
>>>>>> plugin_handle_membership:     Membership 32: quorum retained
>>>>>> Aug 18 12:40:25 [4544] vmfa2757171f stonith-ng:   notice:
>>>>>> crm_update_peer_state_iter:   plugin_handle_membership: Node
>>>>>> vm728316982d[201331884] - state is now lost (was member)
>>>>>> Aug 18 12:40:25 [4548] vmfa2757171f       crmd:     info:
>>>>>> plugin_handle_membership:     Membership 32: quorum retained
>>>>>> Aug 18 12:40:25 [4548] vmfa2757171f       crmd:   notice:
>>>>>> crm_update_peer_state_iter:   plugin_handle_membership: Node
>>>>>> vm728316982d[201331884] - state is now lost (was member)
>>>>>> Aug 18 12:40:25 [4548] vmfa2757171f       crmd:     info:
>>>>>> peer_update_callback: vm728316982d is now lost (was member)
>>>>>> Aug 18 12:40:25 [4548] vmfa2757171f       crmd:  warning:
>>>>>> match_down_event:     No match for shutdown action on
>>>>>> vm728316982d
>>>>>> Aug 18 12:40:25 [4548] vmfa2757171f       crmd:   notice:
>>>>>> peer_update_callback: Stonith/shutdown of vm728316982d not
>>>>>> matched
>>>>>> Aug 18 12:40:25 [4548] vmfa2757171f       crmd:     info:
>>>>>> crm_update_peer_join: peer_update_callback: Node
>>>>>> vm728316982d[201331884] - join-6 phase 4 -> 0
>>>>>> Aug 18 12:40:25 [4548] vmfa2757171f       crmd:     info:
>>>>>> abort_transition_graph:       Transition aborted: Node failure
>>>>>> (source=peer_update_callback:240, 1)
>>>>>> Aug 18 12:40:25 [4543] vmfa2757171f        cib:     info:
>>>>>> plugin_handle_membership:     Membership 32: quorum retained
>>>>>> Aug 18 12:40:25 [4543] vmfa2757171f        cib:   notice:
>>>>>> crm_update_peer_state_iter:   plugin_handle_membership: Node
>>>>>> vm728316982d[201331884] - state is now lost (was member)
>>>>>> Aug 18 12:40:25 [4543] vmfa2757171f        cib:   notice:
>>>>>> crm_reap_dead_member: Removing vm728316982d/201331884 from the
>>>>>> membership list
>>>>>> Aug 18 12:40:25 [4543] vmfa2757171f        cib:   notice:
>>>>>> reap_crm_member:      Purged 1 peers with id=201331884 and/or
>>>>>> uname=vm728316982d from the membership cache
>>>>>> Aug 18 12:40:25 [4544] vmfa2757171f stonith-ng:   notice:
>>>>>> crm_reap_dead_member: Removing vm728316982d/201331884 from the
>>>>>> membership list
>>>>>> Aug 18 12:40:25 [4544] vmfa2757171f stonith-ng:   notice:
>>>>>> reap_crm_member:      Purged 1 peers with id=201331884 and/or
>>>>>> uname=vm728316982d from the membership cache
>>>>>>
>>>>>> However, within seconds, the node was able to join back.
>>>>>>
>>>>>> Aug 18 12:40:34 corosync [pcmk  ] notice: pcmk_peer_update:
>>>>>> Stable
>>>>>> membership event on ring 36: memb=3, new=1, lost=0
>>>>>> Aug 18 12:40:34 corosync [pcmk  ] info: update_member: Node
>>>>>> 201331884/vm728316982d is now: member
>>>>>> Aug 18 12:40:34 corosync [pcmk  ] info: pcmk_peer_update: NEW:
>>>>>> vm728316982d 201331884
>>>>>>
>>>>>>
>>>>>> But this was enough time for the cluster to get into split
>>>>>> brain kind
>>>>>> of situation with  a resource on the node vm728316982d being
>>>>>> stopped
>>>>>> because of this node loss detection.
>>>>>>
>>>>>> Could anyone help whether this could happen due to any
>>>>>> transient
>>>>>> network distortion or so ?
>>>>>> Are there any configuration settings that can be applied in
>>>>>> corosync.conf so that cluster is more resilient to such
>>>>>> temporary
>>>>>> distortions.
>>>>>
>>>>> Your corosync sensitivity of 10-second token timeout and 10
>>>>> retransimissions is already very lengthy -- likely the node was
>>>>> already
>>>>> unresponsive for more than 10 seconds before the first message
>>>>> above,
>>>>> so it was more than 18 seconds before it rejoined.
>>>>>
>>>>> It's rarely a good idea to change
>>>>> token_retransmits_before_loss_const;
>>>>> changing token is generally enough to deal with transient network
>>>>> unreliability. However 18 seconds is a really long time to raise
>>>>> the
>>>>> token to, and it's uncertain from the information here whether
>>>>> the root
>>>>> cause was networking or something on the host.
>>>>>
>>>>> I notice your configuration is corosync 1 with the pacemaker
>>>>> plugin;
>>>>> that is a long-deprecated setup, and corosync 3 is about to come
>>>>> out,
>>>>> so you may want to consider upgrading to at least corosync 2 and
>>>>> a
>>>>> reasonably recent pacemaker. That would give you some reliability
>>>>> improvements, including real-time priority scheduling of
>>>>> corosync,
>>>>> which could have been the issue here if CPU load rather than
>>>>> networking
>>>>> was the root cause.
>>>>>
>>>>>>
>>>>>> Currently my corosync.conf looks like this :
>>>>>>
>>>>>> compatibility: whitetank
>>>>>> totem {
>>>>>>      version: 2
>>>>>>      secauth: on
>>>>>>      threads: 0
>>>>>>      interface {
>>>>>>      member {
>>>>>>              memberaddr: 172.20.0.4
>>>>>>          }
>>>>>> member {
>>>>>>              memberaddr: 172.20.0.9
>>>>>>          }
>>>>>> member {
>>>>>>              memberaddr: 172.20.0.12
>>>>>>          }
>>>>>>
>>>>>>      bindnetaddr: 172.20.0.12
>>>>>>
>>>>>>      ringnumber: 0
>>>>>>      mcastport: 5405
>>>>>>      ttl: 1
>>>>>>      }
>>>>>>      transport: udpu
>>>>>>      token: 10000
>>>>>>      token_retransmits_before_loss_const: 10
>>>>>> }
>>>>>>
>>>>>> logging {
>>>>>>      fileline: off
>>>>>>      to_stderr: yes
>>>>>>      to_logfile: yes
>>>>>>      to_syslog: no
>>>>>>      logfile: /var/log/cluster/corosync.log
>>>>>>      timestamp: on
>>>>>>      logger_subsys {
>>>>>>      subsys: AMF
>>>>>>      debug: off
>>>>>>      }
>>>>>> }
>>>>>> service {
>>>>>>      name: pacemaker
>>>>>>      ver: 1
>>>>>> }
>>>>>> amf {
>>>>>>      mode: disabled
>>>>>> }
>>>>>>
>>>>>> Thanks in advance for the help.
>>>>>> Prasad
>>>>>>
>>>>>> _______________________________________________
>>>>>> Users mailing list: Users at clusterlabs.org
>>>>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>>>>
>>>>>> Project Home: http://www.clusterlabs.org
>>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Sc
>>>>>> ratch.
>>>>>> pdf
>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>
>>>>> --
>>>>> Ken Gaillot <kgaillot at redhat.com>
>>>>> _______________________________________________
>>>>> Users mailing list: Users at clusterlabs.org
>>>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>>>
>>>>> Project Home: http://www.clusterlabs.org
>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scra
>>>>> tch.pdf
>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Users mailing list: Users at clusterlabs.org
>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.
>>> pdf
>>> Bugs: http://bugs.clusterlabs.org
>> --
>> Ken Gaillot <kgaillot at redhat.com>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
> 
> 
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>