[ClusterLabs] 2 nodes split brain with token timeout

Wed Jul 24 02:52:39 EDT 2019

Hello Jan,

Thanks for your input.
Turns out there was a typo in the configuration file (totem instead of token) ...

It should be fine now.

Regards.

Since you 
-----Original Message-----
From: Jan Friesse <jfriesse at redhat.com> 
Sent: mardi 23 juillet 2019 16:08
To: Cluster Labs - All topics related to open-source clustering welcomed <users at clusterlabs.org>; Jean-Jacques Pons <jj.pons at arkadin.com>
Subject: Re: [ClusterLabs] 2 nodes split brain with token timeout

Jean-Jacques,

> Hello everyone,
> 
> I'm having stability issue with a 2 nodes active/passive HA infrastructure (Zabbix VMs in this case).
> Daily backup create a latency, slowing Corosync scheduling and triggering a token timeout. It frequently ends up on a split brain issue, where service is started on both nodes at the same time.
> 
> I did increase the token timeout to 4000 by updating corosync.conf, on both nodes, followed by the command "sudo corosync-cfgtool -R".
> But this doesn't reflect in the log message ...

Which message you mean? "not scheduled" one?

> 1st question : Why ?

I'm almost sure it is reflected.

> 2nd question : I find reference to increasing token_retransmits_before_loss_const. Should I ? To which value ?

Nope.

> 
> Best regards.
> 
> JJ
> 
> 
> NODE 2
> Jul 22 13:30:52 FRPLZABPXY02 corosync[11552]:  [MAIN  ] Corosync main process was not scheduled for 9902.1504 ms (threshold is 800.0000 ms). Consider token timeout increase.

Machine was not scheduled for 9 second, so 4 second token timeout is not enough.

Regards,
   Honza

> Jul 22 13:30:52 FRPLZABPXY02 corosync[11552]:  [TOTEM ] A processor failed, forming new configuration.
> Jul 22 13:31:03 FRPLZABPXY02 corosync[11552]:  [TOTEM ] A new 
> membership (10.XX.YY.1:5808) was formed. Members joined: 1 left: 1 Jul 
> 22 13:31:03 FRPLZABPXY02 corosync[11552]:  [TOTEM ] Failed to receive 
> the leave message. failed: 1 Jul 22 13:31:03 FRPLZABPXY02 corosync[11552]:  [QUORUM] Members[2]: 1 2 Jul 22 13:31:03 FRPLZABPXY02 corosync[11552]:  [MAIN  ] Completed service synchronization, ready to provide service.
> 
> 
> NODE1
> Jul 22 13:30:55 FRPLZABPXY01 corosync[1110]:  [TOTEM ] A processor failed, forming new configuration.
> Jul 22 13:30:56 FRPLZABPXY01 corosync[1110]:  [TOTEM ] A new 
> membership (10.XX.YY.1:5804) was formed. Members left: 2 Jul 22 
> 13:30:56 FRPLZABPXY01 corosync[1110]:  [TOTEM ] Failed to receive the 
> leave message. failed: 2 Jul 22 13:30:56 FRPLZABPXY01 corosync[1110]:  [QUORUM] Members[1]: 1 Jul 22 13:30:56 FRPLZABPXY01 corosync[1110]:  [MAIN  ] Completed service synchronization, ready to provide service.
> Jul 22 13:31:03 FRPLZABPXY01 corosync[1110]:  [TOTEM ] A new 
> membership (10.XX.YY.1:5808) was formed. Members joined: 2 Jul 22 
> 13:31:03 FRPLZABPXY01 corosync[1110]:  [QUORUM] Members[2]: 1 2 Jul 22 13:31:03 FRPLZABPXY01 corosync[1110]:  [MAIN  ] Completed service synchronization, ready to provide service.
> 
> 
> cat /etc/corosync/corosync.conf
> totem {
>      version: 2
>      secauth: off
>      cluster_name: FRPLZABPXY
>      transport: udpu
>      totem: 4000
>      interface {
>          ringnumber: 0
>          bindnetaddr: 10.XX.YY.2
>          broadcast: yes
>          mcastport: 5405
>      }
> }
> nodelist {
>      node {
>          ring0_addr: 10.XX.YY.1
>          name: FRPLZABPXY01
>          nodeid: 1
>      }
> 
>      node {
>          ring0_addr: 10.XX.YY.2
>          name: FRPLZABPXY02
>          nodeid: 2
>      }
> }
> quorum {
>      provider: corosync_votequorum
>      two_node: 1
> }
> logging {
>      to_logfile: yes
>      logfile: /var/log/cluster/corosync.log
>      to_syslog: yes
> }
> 
> 
> sudo corosync-cmapctl | grep -E "(.config.totem.|^totem.)"
> runtime.config.totem.consensus (u32) = 1200 
> runtime.config.totem.downcheck (u32) = 1000 
> runtime.config.totem.fail_recv_const (u32) = 2500 
> runtime.config.totem.heartbeat_failures_allowed (u32) = 0 
> runtime.config.totem.hold (u32) = 180 runtime.config.totem.join (u32) 
> = 50 runtime.config.totem.max_messages (u32) = 17 
> runtime.config.totem.max_network_delay (u32) = 50 
> runtime.config.totem.merge (u32) = 200 
> runtime.config.totem.miss_count_const (u32) = 5 
> runtime.config.totem.rrp_autorecovery_check_timeout (u32) = 1000 
> runtime.config.totem.rrp_problem_count_mcast_threshold (u32) = 100 
> runtime.config.totem.rrp_problem_count_threshold (u32) = 10 
> runtime.config.totem.rrp_problem_count_timeout (u32) = 2000 
> runtime.config.totem.rrp_token_expired_timeout (u32) = 238 
> runtime.config.totem.send_join (u32) = 0 
> runtime.config.totem.seqno_unchanged_const (u32) = 30 
> runtime.config.totem.token (u32) = 1000 
> runtime.config.totem.token_retransmit (u32) = 238 
> runtime.config.totem.token_retransmits_before_loss_const (u32) = 4 
> runtime.config.totem.window_size (u32) = 50 totem.cluster_name (str) = 
> FRPLZABPXY totem.interface.0.bindnetaddr (str) = 10.XX.YY.2 
> totem.interface.0.broadcast (str) = yes totem.interface.0.mcastport 
> (u16) = 5405 totem.secauth (str) = off totem.totem (str) = 4000 
> totem.transport (str) = udpu totem.version (u32) = 2
> 
> 
> 
> _______________________________________________
> Manage your subscription:
> https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.clusterlabs
> .org_mailman_listinfo_users&d=DwICAw&c=aj9Ha5yMZahcf_BRtDCWCQ&r=5R2Ocp
> 35xnaT44LYYtyPb7QQ4yrV00pN0EuDvf5qP5M&m=EtV_TD7xre5ALwKmPr-d33yj9zuBFH
> dZnBUaKn7w4yY&s=kj0N_9SYd2M-wq7rWHNPqrcTh8p_VdSrvPa05AxyaqY&e=
> 
> ClusterLabs home: 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.clusterlabs.o
> rg_&d=DwICAw&c=aj9Ha5yMZahcf_BRtDCWCQ&r=5R2Ocp35xnaT44LYYtyPb7QQ4yrV00pN0EuDvf5qP5M&m=EtV_TD7xre5ALwKmPr-d33yj9zuBFHdZnBUaKn7w4yY&s=KFVKKRHOJUdTTyPtYwTI-1xPPCFurXGqTtVVsAreY7M&e=
>