[ClusterLabs] Antw: Re: Spurious node loss in corosync cluster

Thu Aug 23 06:19:18 UTC 2018

Prasad,

> Hi - My systems are single core cpu VMs running on azure platform. I am

Ok, now it make sense. I don't think you get too much guarantees in the 
cloud environment so quite a large scheduling pause simply can happen. 
Also single core CPU is kind of "unsupported" today.

> running MySQL on the nodes that do generate high io load. And my bad , I
> meant to say 'High CPU load detected' logged by crmd and not corosync.
> Corosync logs messages like 'Corosync main process was not scheduled
> for.....' kind of messages which inturn makes pacemaker monitor action to
> fail sometimes. Is increasing token timeout a solution for this or any
> other ways ?

Yes. Actually increasing token timeout to quite a big values seems to be 
  only "solution". Please keep in mind that it's not for free - when 
real problem happens corosync does not detect it before token timeout 
expires, so with large token timeout detection last long.

Regards,
   Honza

> 
> Thanks for the help
> Prasaf
> 
> On Wed, 22 Aug 2018, 11:55 am Jan Friesse, <jfriesse at redhat.com> wrote:
> 
>> Prasad,
>>
>>> Thanks Ken and Ulrich. There is definitely high IO on the system with
>>> sometimes IOWAIT s of upto 90%
>>> I have come across some previous posts that IOWAIT is also considered as
>>> CPU load by Corosync. Is this true ? Does having high IO may lead
>> corosync
>>> complain as in " Corosync main process was not scheduled for..." or "Hi
>>> CPU load detected.." ?
>>
>> Yes it can.
>>
>> Corosync never logs "Hi CPU load detected...".
>>
>>>
>>> I will surely monitor the system more.
>>
>> Is that system VM or physical machine? Because " Corosync main process
>> was not scheduled for..." is usually happening on VMs where hosts are
>> highly overloaded.
>>
>> Honza
>>
>>>
>>> Thanks for your help.
>>> Prasad
>>>
>>>
>>>
>>> On Tue, Aug 21, 2018 at 9:07 PM, Ken Gaillot <kgaillot at redhat.com>
>> wrote:
>>>
>>>> On Tue, 2018-08-21 at 15:29 +0200, Ulrich Windl wrote:
>>>>>>>> Prasad Nagaraj <prasad.nagaraj76 at gmail.com> schrieb am
>>>>>>>> 21.08.2018 um 11:42 in
>>>>>
>>>>> Nachricht
>>>>> <CAHbCUJ0zdvpYALCR7tbnGgb8qrZHh8uDjE+RsnkoewvmFb8wAg at mail.gmail.com>:
>>>>>> Hi Ken - Thanks for you response.
>>>>>>
>>>>>> We do have seen messages in other cases like
>>>>>> corosync [MAIN  ] Corosync main process was not scheduled for
>>>>>> 17314.4746 ms
>>>>>> (threshold is 8000.0000 ms). Consider token timeout increase.
>>>>>> corosync [TOTEM ] A processor failed, forming new configuration.
>>>>>>
>>>>>> Is this the indication of a failure due to CPU load issues and will
>>>>>> this
>>>>>> get resolved if I upgrade to Corosync 2.x series ?
>>>>
>>>> Yes, most definitely this is a CPU issue. It means corosync isn't
>>>> getting enough CPU cycles to handle the cluster token before the
>>>> timeout is reached.
>>>>
>>>> Upgrading may indeed help, as recent versions ensure that corosync runs
>>>> with real-time priority in the kernel, and thus are more likely to get
>>>> CPU time when something of lower priority is consuming all the CPU.
>>>>
>>>> But of course, there is some underlying problem that should be
>>>> identified and addressed. Figure out what's maxing out the CPU or I/O.
>>>> Ulrich's monitoring suggestion is a good start.
>>>>
>>>>> Hi!
>>>>>
>>>>> I'd strongly recommend starting monitoring on your nodes, at least
>>>>> until you know what's going on. The good old UNIX sa (sysstat
>>>>> package) could be a starting point. I'd monitor CPU idle
>>>>> specifically. Then go for 100% device utilization, then look for
>>>>> network bottlenecks...
>>>>>
>>>>> A new corosync release cannot fix those, most likely.
>>>>>
>>>>> Regards,
>>>>> Ulrich
>>>>>
>>>>>>
>>>>>> In any case, for the current scenario, we did not see any
>>>>>> scheduling
>>>>>> related messages.
>>>>>>
>>>>>> Thanks for your help.
>>>>>> Prasad
>>>>>>
>>>>>> On Mon, Aug 20, 2018 at 7:57 PM, Ken Gaillot <kgaillot at redhat.com>
>>>>>> wrote:
>>>>>>
>>>>>>> On Sun, 2018-08-19 at 17:35 +0530, Prasad Nagaraj wrote:
>>>>>>>> Hi:
>>>>>>>>
>>>>>>>> One of these days, I saw a spurious node loss on my 3-node
>>>>>>>> corosync
>>>>>>>> cluster with following logged in the corosync.log of one of the
>>>>>>>> nodes.
>>>>>>>>
>>>>>>>> Aug 18 12:40:25 corosync [pcmk  ] notice: pcmk_peer_update:
>>>>>>>> Transitional membership event on ring 32: memb=2, new=0, lost=1
>>>>>>>> Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: memb:
>>>>>>>> vm02d780875f 67114156
>>>>>>>> Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: memb:
>>>>>>>> vmfa2757171f 151000236
>>>>>>>> Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: lost:
>>>>>>>> vm728316982d 201331884
>>>>>>>> Aug 18 12:40:25 corosync [pcmk  ] notice: pcmk_peer_update:
>>>>>>>> Stable
>>>>>>>> membership event on ring 32: memb=2, new=0, lost=0
>>>>>>>> Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
>>>>>>>> vm02d780875f 67114156
>>>>>>>> Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
>>>>>>>> vmfa2757171f 151000236
>>>>>>>> Aug 18 12:40:25 corosync [pcmk  ] info:
>>>>>>>> ais_mark_unseen_peer_dead:
>>>>>>>> Node vm728316982d was not seen in the previous transition
>>>>>>>> Aug 18 12:40:25 corosync [pcmk  ] info: update_member: Node
>>>>>>>> 201331884/vm728316982d is now: lost
>>>>>>>> Aug 18 12:40:25 corosync [pcmk  ] info:
>>>>>>>> send_member_notification:
>>>>>>>> Sending membership update 32 to 3 children
>>>>>>>> Aug 18 12:40:25 corosync [TOTEM ] A processor joined or left
>>>>>>>> the
>>>>>>>> membership and a new membership was formed.
>>>>>>>> Aug 18 12:40:25 [4544] vmfa2757171f stonith-ng:     info:
>>>>>>>> plugin_handle_membership:     Membership 32: quorum retained
>>>>>>>> Aug 18 12:40:25 [4544] vmfa2757171f stonith-ng:   notice:
>>>>>>>> crm_update_peer_state_iter:   plugin_handle_membership: Node
>>>>>>>> vm728316982d[201331884] - state is now lost (was member)
>>>>>>>> Aug 18 12:40:25 [4548] vmfa2757171f       crmd:     info:
>>>>>>>> plugin_handle_membership:     Membership 32: quorum retained
>>>>>>>> Aug 18 12:40:25 [4548] vmfa2757171f       crmd:   notice:
>>>>>>>> crm_update_peer_state_iter:   plugin_handle_membership: Node
>>>>>>>> vm728316982d[201331884] - state is now lost (was member)
>>>>>>>> Aug 18 12:40:25 [4548] vmfa2757171f       crmd:     info:
>>>>>>>> peer_update_callback: vm728316982d is now lost (was member)
>>>>>>>> Aug 18 12:40:25 [4548] vmfa2757171f       crmd:  warning:
>>>>>>>> match_down_event:     No match for shutdown action on
>>>>>>>> vm728316982d
>>>>>>>> Aug 18 12:40:25 [4548] vmfa2757171f       crmd:   notice:
>>>>>>>> peer_update_callback: Stonith/shutdown of vm728316982d not
>>>>>>>> matched
>>>>>>>> Aug 18 12:40:25 [4548] vmfa2757171f       crmd:     info:
>>>>>>>> crm_update_peer_join: peer_update_callback: Node
>>>>>>>> vm728316982d[201331884] - join-6 phase 4 -> 0
>>>>>>>> Aug 18 12:40:25 [4548] vmfa2757171f       crmd:     info:
>>>>>>>> abort_transition_graph:       Transition aborted: Node failure
>>>>>>>> (source=peer_update_callback:240, 1)
>>>>>>>> Aug 18 12:40:25 [4543] vmfa2757171f        cib:     info:
>>>>>>>> plugin_handle_membership:     Membership 32: quorum retained
>>>>>>>> Aug 18 12:40:25 [4543] vmfa2757171f        cib:   notice:
>>>>>>>> crm_update_peer_state_iter:   plugin_handle_membership: Node
>>>>>>>> vm728316982d[201331884] - state is now lost (was member)
>>>>>>>> Aug 18 12:40:25 [4543] vmfa2757171f        cib:   notice:
>>>>>>>> crm_reap_dead_member: Removing vm728316982d/201331884 from the
>>>>>>>> membership list
>>>>>>>> Aug 18 12:40:25 [4543] vmfa2757171f        cib:   notice:
>>>>>>>> reap_crm_member:      Purged 1 peers with id=201331884 and/or
>>>>>>>> uname=vm728316982d from the membership cache
>>>>>>>> Aug 18 12:40:25 [4544] vmfa2757171f stonith-ng:   notice:
>>>>>>>> crm_reap_dead_member: Removing vm728316982d/201331884 from the
>>>>>>>> membership list
>>>>>>>> Aug 18 12:40:25 [4544] vmfa2757171f stonith-ng:   notice:
>>>>>>>> reap_crm_member:      Purged 1 peers with id=201331884 and/or
>>>>>>>> uname=vm728316982d from the membership cache
>>>>>>>>
>>>>>>>> However, within seconds, the node was able to join back.
>>>>>>>>
>>>>>>>> Aug 18 12:40:34 corosync [pcmk  ] notice: pcmk_peer_update:
>>>>>>>> Stable
>>>>>>>> membership event on ring 36: memb=3, new=1, lost=0
>>>>>>>> Aug 18 12:40:34 corosync [pcmk  ] info: update_member: Node
>>>>>>>> 201331884/vm728316982d is now: member
>>>>>>>> Aug 18 12:40:34 corosync [pcmk  ] info: pcmk_peer_update: NEW:
>>>>>>>> vm728316982d 201331884
>>>>>>>>
>>>>>>>>
>>>>>>>> But this was enough time for the cluster to get into split
>>>>>>>> brain kind
>>>>>>>> of situation with  a resource on the node vm728316982d being
>>>>>>>> stopped
>>>>>>>> because of this node loss detection.
>>>>>>>>
>>>>>>>> Could anyone help whether this could happen due to any
>>>>>>>> transient
>>>>>>>> network distortion or so ?
>>>>>>>> Are there any configuration settings that can be applied in
>>>>>>>> corosync.conf so that cluster is more resilient to such
>>>>>>>> temporary
>>>>>>>> distortions.
>>>>>>>
>>>>>>> Your corosync sensitivity of 10-second token timeout and 10
>>>>>>> retransimissions is already very lengthy -- likely the node was
>>>>>>> already
>>>>>>> unresponsive for more than 10 seconds before the first message
>>>>>>> above,
>>>>>>> so it was more than 18 seconds before it rejoined.
>>>>>>>
>>>>>>> It's rarely a good idea to change
>>>>>>> token_retransmits_before_loss_const;
>>>>>>> changing token is generally enough to deal with transient network
>>>>>>> unreliability. However 18 seconds is a really long time to raise
>>>>>>> the
>>>>>>> token to, and it's uncertain from the information here whether
>>>>>>> the root
>>>>>>> cause was networking or something on the host.
>>>>>>>
>>>>>>> I notice your configuration is corosync 1 with the pacemaker
>>>>>>> plugin;
>>>>>>> that is a long-deprecated setup, and corosync 3 is about to come
>>>>>>> out,
>>>>>>> so you may want to consider upgrading to at least corosync 2 and
>>>>>>> a
>>>>>>> reasonably recent pacemaker. That would give you some reliability
>>>>>>> improvements, including real-time priority scheduling of
>>>>>>> corosync,
>>>>>>> which could have been the issue here if CPU load rather than
>>>>>>> networking
>>>>>>> was the root cause.
>>>>>>>
>>>>>>>>
>>>>>>>> Currently my corosync.conf looks like this :
>>>>>>>>
>>>>>>>> compatibility: whitetank
>>>>>>>> totem {
>>>>>>>>       version: 2
>>>>>>>>       secauth: on
>>>>>>>>       threads: 0
>>>>>>>>       interface {
>>>>>>>>       member {
>>>>>>>>               memberaddr: 172.20.0.4
>>>>>>>>           }
>>>>>>>> member {
>>>>>>>>               memberaddr: 172.20.0.9
>>>>>>>>           }
>>>>>>>> member {
>>>>>>>>               memberaddr: 172.20.0.12
>>>>>>>>           }
>>>>>>>>
>>>>>>>>       bindnetaddr: 172.20.0.12
>>>>>>>>
>>>>>>>>       ringnumber: 0
>>>>>>>>       mcastport: 5405
>>>>>>>>       ttl: 1
>>>>>>>>       }
>>>>>>>>       transport: udpu
>>>>>>>>       token: 10000
>>>>>>>>       token_retransmits_before_loss_const: 10
>>>>>>>> }
>>>>>>>>
>>>>>>>> logging {
>>>>>>>>       fileline: off
>>>>>>>>       to_stderr: yes
>>>>>>>>       to_logfile: yes
>>>>>>>>       to_syslog: no
>>>>>>>>       logfile: /var/log/cluster/corosync.log
>>>>>>>>       timestamp: on
>>>>>>>>       logger_subsys {
>>>>>>>>       subsys: AMF
>>>>>>>>       debug: off
>>>>>>>>       }
>>>>>>>> }
>>>>>>>> service {
>>>>>>>>       name: pacemaker
>>>>>>>>       ver: 1
>>>>>>>> }
>>>>>>>> amf {
>>>>>>>>       mode: disabled
>>>>>>>> }
>>>>>>>>
>>>>>>>> Thanks in advance for the help.
>>>>>>>> Prasad
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Users mailing list: Users at clusterlabs.org
>>>>>>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>>>>>>
>>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Sc
>>>>>>>> ratch.
>>>>>>>> pdf
>>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>>
>>>>>>> --
>>>>>>> Ken Gaillot <kgaillot at redhat.com>
>>>>>>> _______________________________________________
>>>>>>> Users mailing list: Users at clusterlabs.org
>>>>>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>>>>>
>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scra
>>>>>>> tch.pdf
>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Users mailing list: Users at clusterlabs.org
>>>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>>>
>>>>> Project Home: http://www.clusterlabs.org
>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.
>>>>> pdf
>>>>> Bugs: http://bugs.clusterlabs.org
>>>> --
>>>> Ken Gaillot <kgaillot at redhat.com>
>>>> _______________________________________________
>>>> Users mailing list: Users at clusterlabs.org
>>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>>
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started:
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Users mailing list: Users at clusterlabs.org
>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>>
>>
>>
>