[ClusterLabs] Antw: Re: Spurious node loss in corosync cluster

Tue Aug 21 09:29:14 EDT 2018

>>> Prasad Nagaraj <prasad.nagaraj76 at gmail.com> schrieb am 21.08.2018 um 11:42 in
Nachricht
<CAHbCUJ0zdvpYALCR7tbnGgb8qrZHh8uDjE+RsnkoewvmFb8wAg at mail.gmail.com>:
> Hi Ken - Thanks for you response.
> 
> We do have seen messages in other cases like
> corosync [MAIN  ] Corosync main process was not scheduled for 17314.4746 ms
> (threshold is 8000.0000 ms). Consider token timeout increase.
> corosync [TOTEM ] A processor failed, forming new configuration.
> 
> Is this the indication of a failure due to CPU load issues and will this
> get resolved if I upgrade to Corosync 2.x series ?

Hi!

I'd strongly recommend starting monitoring on your nodes, at least until you know what's going on. The good old UNIX sa (sysstat package) could be a starting point. I'd monitor CPU idle specifically. Then go for 100% device utilization, then look for network bottlenecks...

A new corosync release cannot fix those, most likely.

Regards,
Ulrich

> 
> In any case, for the current scenario, we did not see any scheduling
> related messages.
> 
> Thanks for your help.
> Prasad
> 
> On Mon, Aug 20, 2018 at 7:57 PM, Ken Gaillot <kgaillot at redhat.com> wrote:
> 
>> On Sun, 2018-08-19 at 17:35 +0530, Prasad Nagaraj wrote:
>> > Hi:
>> >
>> > One of these days, I saw a spurious node loss on my 3-node corosync
>> > cluster with following logged in the corosync.log of one of the
>> > nodes.
>> >
>> > Aug 18 12:40:25 corosync [pcmk  ] notice: pcmk_peer_update:
>> > Transitional membership event on ring 32: memb=2, new=0, lost=1
>> > Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: memb:
>> > vm02d780875f 67114156
>> > Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: memb:
>> > vmfa2757171f 151000236
>> > Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: lost:
>> > vm728316982d 201331884
>> > Aug 18 12:40:25 corosync [pcmk  ] notice: pcmk_peer_update: Stable
>> > membership event on ring 32: memb=2, new=0, lost=0
>> > Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
>> > vm02d780875f 67114156
>> > Aug 18 12:40:25 corosync [pcmk  ] info: pcmk_peer_update: MEMB:
>> > vmfa2757171f 151000236
>> > Aug 18 12:40:25 corosync [pcmk  ] info: ais_mark_unseen_peer_dead:
>> > Node vm728316982d was not seen in the previous transition
>> > Aug 18 12:40:25 corosync [pcmk  ] info: update_member: Node
>> > 201331884/vm728316982d is now: lost
>> > Aug 18 12:40:25 corosync [pcmk  ] info: send_member_notification:
>> > Sending membership update 32 to 3 children
>> > Aug 18 12:40:25 corosync [TOTEM ] A processor joined or left the
>> > membership and a new membership was formed.
>> > Aug 18 12:40:25 [4544] vmfa2757171f stonith-ng:     info:
>> > plugin_handle_membership:     Membership 32: quorum retained
>> > Aug 18 12:40:25 [4544] vmfa2757171f stonith-ng:   notice:
>> > crm_update_peer_state_iter:   plugin_handle_membership: Node
>> > vm728316982d[201331884] - state is now lost (was member)
>> > Aug 18 12:40:25 [4548] vmfa2757171f       crmd:     info:
>> > plugin_handle_membership:     Membership 32: quorum retained
>> > Aug 18 12:40:25 [4548] vmfa2757171f       crmd:   notice:
>> > crm_update_peer_state_iter:   plugin_handle_membership: Node
>> > vm728316982d[201331884] - state is now lost (was member)
>> > Aug 18 12:40:25 [4548] vmfa2757171f       crmd:     info:
>> > peer_update_callback: vm728316982d is now lost (was member)
>> > Aug 18 12:40:25 [4548] vmfa2757171f       crmd:  warning:
>> > match_down_event:     No match for shutdown action on vm728316982d
>> > Aug 18 12:40:25 [4548] vmfa2757171f       crmd:   notice:
>> > peer_update_callback: Stonith/shutdown of vm728316982d not matched
>> > Aug 18 12:40:25 [4548] vmfa2757171f       crmd:     info:
>> > crm_update_peer_join: peer_update_callback: Node
>> > vm728316982d[201331884] - join-6 phase 4 -> 0
>> > Aug 18 12:40:25 [4548] vmfa2757171f       crmd:     info:
>> > abort_transition_graph:       Transition aborted: Node failure
>> > (source=peer_update_callback:240, 1)
>> > Aug 18 12:40:25 [4543] vmfa2757171f        cib:     info:
>> > plugin_handle_membership:     Membership 32: quorum retained
>> > Aug 18 12:40:25 [4543] vmfa2757171f        cib:   notice:
>> > crm_update_peer_state_iter:   plugin_handle_membership: Node
>> > vm728316982d[201331884] - state is now lost (was member)
>> > Aug 18 12:40:25 [4543] vmfa2757171f        cib:   notice:
>> > crm_reap_dead_member: Removing vm728316982d/201331884 from the
>> > membership list
>> > Aug 18 12:40:25 [4543] vmfa2757171f        cib:   notice:
>> > reap_crm_member:      Purged 1 peers with id=201331884 and/or
>> > uname=vm728316982d from the membership cache
>> > Aug 18 12:40:25 [4544] vmfa2757171f stonith-ng:   notice:
>> > crm_reap_dead_member: Removing vm728316982d/201331884 from the
>> > membership list
>> > Aug 18 12:40:25 [4544] vmfa2757171f stonith-ng:   notice:
>> > reap_crm_member:      Purged 1 peers with id=201331884 and/or
>> > uname=vm728316982d from the membership cache
>> >
>> > However, within seconds, the node was able to join back.
>> >
>> > Aug 18 12:40:34 corosync [pcmk  ] notice: pcmk_peer_update: Stable
>> > membership event on ring 36: memb=3, new=1, lost=0
>> > Aug 18 12:40:34 corosync [pcmk  ] info: update_member: Node
>> > 201331884/vm728316982d is now: member
>> > Aug 18 12:40:34 corosync [pcmk  ] info: pcmk_peer_update: NEW:
>> > vm728316982d 201331884
>> >
>> >
>> > But this was enough time for the cluster to get into split brain kind
>> > of situation with  a resource on the node vm728316982d being stopped
>> > because of this node loss detection.
>> >
>> > Could anyone help whether this could happen due to any transient
>> > network distortion or so ?
>> > Are there any configuration settings that can be applied in
>> > corosync.conf so that cluster is more resilient to such temporary
>> > distortions.
>>
>> Your corosync sensitivity of 10-second token timeout and 10
>> retransimissions is already very lengthy -- likely the node was already
>> unresponsive for more than 10 seconds before the first message above,
>> so it was more than 18 seconds before it rejoined.
>>
>> It's rarely a good idea to change token_retransmits_before_loss_const;
>> changing token is generally enough to deal with transient network
>> unreliability. However 18 seconds is a really long time to raise the
>> token to, and it's uncertain from the information here whether the root
>> cause was networking or something on the host.
>>
>> I notice your configuration is corosync 1 with the pacemaker plugin;
>> that is a long-deprecated setup, and corosync 3 is about to come out,
>> so you may want to consider upgrading to at least corosync 2 and a
>> reasonably recent pacemaker. That would give you some reliability
>> improvements, including real-time priority scheduling of corosync,
>> which could have been the issue here if CPU load rather than networking
>> was the root cause.
>>
>> >
>> > Currently my corosync.conf looks like this :
>> >
>> > compatibility: whitetank
>> > totem {
>> >     version: 2
>> >     secauth: on
>> >     threads: 0
>> >     interface {
>> >     member {
>> >             memberaddr: 172.20.0.4
>> >         }
>> > member {
>> >             memberaddr: 172.20.0.9
>> >         }
>> > member {
>> >             memberaddr: 172.20.0.12
>> >         }
>> >
>> >     bindnetaddr: 172.20.0.12
>> >
>> >     ringnumber: 0
>> >     mcastport: 5405
>> >     ttl: 1
>> >     }
>> >     transport: udpu
>> >     token: 10000
>> >     token_retransmits_before_loss_const: 10
>> > }
>> >
>> > logging {
>> >     fileline: off
>> >     to_stderr: yes
>> >     to_logfile: yes
>> >     to_syslog: no
>> >     logfile: /var/log/cluster/corosync.log
>> >     timestamp: on
>> >     logger_subsys {
>> >     subsys: AMF
>> >     debug: off
>> >     }
>> > }
>> > service {
>> >     name: pacemaker
>> >     ver: 1
>> > }
>> > amf {
>> >     mode: disabled
>> > }
>> >
>> > Thanks in advance for the help.
>> > Prasad
>> >
>> > _______________________________________________
>> > Users mailing list: Users at clusterlabs.org 
>> > https://lists.clusterlabs.org/mailman/listinfo/users 
>> >
>> > Project Home: http://www.clusterlabs.org 
>> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.
>> > pdf
>> > Bugs: http://bugs.clusterlabs.org 
>> --
>> Ken Gaillot <kgaillot at redhat.com>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org 
>> https://lists.clusterlabs.org/mailman/listinfo/users 
>>
>> Project Home: http://www.clusterlabs.org 
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>> Bugs: http://bugs.clusterlabs.org 
>>