[ClusterLabs] 2 nodes split brain with token timeout

Ken Gaillot kgaillot at redhat.com
Tue Jul 23 10:13:45 EDT 2019


On Mon, 2019-07-22 at 15:54 +0000, Jean-Jacques Pons wrote:
> Hello everyone,
>  
> I’m having stability issue with a 2 nodes active/passive HA
> infrastructure (Zabbix VMs in this case).
> Daily backup create a latency, slowing Corosync scheduling and
> triggering a token timeout. It frequently ends up on a split brain
> issue, where service is started on both nodes at the same time.
>  
> I did increase the token timeout to 4000 by updating corosync.conf, 

Rather than slow down the cluster's response time at all times, a
better approach might be to use a pacemaker rule to put the cluster in
maintenance mode around the time of the backup.

How you do that varies by the tool you're using, but for the low level
see:

https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/2.0/html-single/Pacemaker_Explained/index.html#_using_rules_to_control_cluster_options

(in this case, maintenance-mode is in the
crm_config/cluster_property_set section instead of
rsc_defaults/meta_attributes)

> on both nodes, followed by the command “sudo corosync-cfgtool -R”.
> But this doesn’t reflect in the log message …
> 1st question : Why ?
> 2nd question : I find reference to increasing
> token_retransmits_before_loss_const. Should I ? To which value ?
>  
> Best regards.
>  
> JJ
>  
>  
> NODE 2
> Jul 22 13:30:52 FRPLZABPXY02 corosync[11552]:  [MAIN  ] Corosync main
> process was not scheduled for 9902.1504 ms (threshold is 800.0000
> ms). Consider token timeout increase.
> Jul 22 13:30:52 FRPLZABPXY02 corosync[11552]:  [TOTEM ] A processor
> failed, forming new configuration.
> Jul 22 13:31:03 FRPLZABPXY02 corosync[11552]:  [TOTEM ] A new
> membership (10.XX.YY.1:5808) was formed. Members joined: 1 left: 1
> Jul 22 13:31:03 FRPLZABPXY02 corosync[11552]:  [TOTEM ] Failed to
> receive the leave message. failed: 1
> Jul 22 13:31:03 FRPLZABPXY02 corosync[11552]:  [QUORUM] Members[2]: 1
> 2
> Jul 22 13:31:03 FRPLZABPXY02 corosync[11552]:  [MAIN  ] Completed
> service synchronization, ready to provide service.
>  
>  
> NODE1
> Jul 22 13:30:55 FRPLZABPXY01 corosync[1110]:  [TOTEM ] A processor
> failed, forming new configuration.
> Jul 22 13:30:56 FRPLZABPXY01 corosync[1110]:  [TOTEM ] A new
> membership (10.XX.YY.1:5804) was formed. Members left: 2
> Jul 22 13:30:56 FRPLZABPXY01 corosync[1110]:  [TOTEM ] Failed to
> receive the leave message. failed: 2
> Jul 22 13:30:56 FRPLZABPXY01 corosync[1110]:  [QUORUM] Members[1]: 1
> Jul 22 13:30:56 FRPLZABPXY01 corosync[1110]:  [MAIN  ] Completed
> service synchronization, ready to provide service.
> Jul 22 13:31:03 FRPLZABPXY01 corosync[1110]:  [TOTEM ] A new
> membership (10.XX.YY.1:5808) was formed. Members joined: 2
> Jul 22 13:31:03 FRPLZABPXY01 corosync[1110]:  [QUORUM] Members[2]: 1
> 2
> Jul 22 13:31:03 FRPLZABPXY01 corosync[1110]:  [MAIN  ] Completed
> service synchronization, ready to provide service.
>  
>  
> cat /etc/corosync/corosync.conf
> totem {
>     version: 2
>     secauth: off
>     cluster_name: FRPLZABPXY
>     transport: udpu
>     totem: 4000
>     interface {
>         ringnumber: 0
>         bindnetaddr: 10.XX.YY.2
>         broadcast: yes
>         mcastport: 5405
>     }
> }
> nodelist {
>     node {
>         ring0_addr: 10.XX.YY.1
>         name: FRPLZABPXY01
>         nodeid: 1
>     }
>  
>     node {
>         ring0_addr: 10.XX.YY.2
>         name: FRPLZABPXY02
>         nodeid: 2
>     }
> }
> quorum {
>     provider: corosync_votequorum
>     two_node: 1
> }
> logging {
>     to_logfile: yes
>     logfile: /var/log/cluster/corosync.log
>     to_syslog: yes
> }
>  
>  
> sudo corosync-cmapctl | grep -E "(.config.totem.|^totem.)"
> runtime.config.totem.consensus (u32) = 1200
> runtime.config.totem.downcheck (u32) = 1000
> runtime.config.totem.fail_recv_const (u32) = 2500
> runtime.config.totem.heartbeat_failures_allowed (u32) = 0
> runtime.config.totem.hold (u32) = 180
> runtime.config.totem.join (u32) = 50
> runtime.config.totem.max_messages (u32) = 17
> runtime.config.totem.max_network_delay (u32) = 50
> runtime.config.totem.merge (u32) = 200
> runtime.config.totem.miss_count_const (u32) = 5
> runtime.config.totem.rrp_autorecovery_check_timeout (u32) = 1000
> runtime.config.totem.rrp_problem_count_mcast_threshold (u32) = 100
> runtime.config.totem.rrp_problem_count_threshold (u32) = 10
> runtime.config.totem.rrp_problem_count_timeout (u32) = 2000
> runtime.config.totem.rrp_token_expired_timeout (u32) = 238
> runtime.config.totem.send_join (u32) = 0
> runtime.config.totem.seqno_unchanged_const (u32) = 30
> runtime.config.totem.token (u32) = 1000
> runtime.config.totem.token_retransmit (u32) = 238
> runtime.config.totem.token_retransmits_before_loss_const (u32) = 4
> runtime.config.totem.window_size (u32) = 50
> totem.cluster_name (str) = FRPLZABPXY
> totem.interface.0.bindnetaddr (str) = 10.XX.YY.2
> totem.interface.0.broadcast (str) = yes
> totem.interface.0.mcastport (u16) = 5405
> totem.secauth (str) = off
> totem.totem (str) = 4000
> totem.transport (str) = udpu
> totem.version (u32) = 2
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
-- 
Ken Gaillot <kgaillot at redhat.com>



More information about the Users mailing list