[ClusterLabs] 2 nodes split brain with token timeout
Jan Friesse
jfriesse at redhat.com
Tue Jul 23 10:08:19 EDT 2019
Jean-Jacques,
> Hello everyone,
>
> I'm having stability issue with a 2 nodes active/passive HA infrastructure (Zabbix VMs in this case).
> Daily backup create a latency, slowing Corosync scheduling and triggering a token timeout. It frequently ends up on a split brain issue, where service is started on both nodes at the same time.
>
> I did increase the token timeout to 4000 by updating corosync.conf, on both nodes, followed by the command "sudo corosync-cfgtool -R".
> But this doesn't reflect in the log message ...
Which message you mean? "not scheduled" one?
> 1st question : Why ?
I'm almost sure it is reflected.
> 2nd question : I find reference to increasing token_retransmits_before_loss_const. Should I ? To which value ?
Nope.
>
> Best regards.
>
> JJ
>
>
> NODE 2
> Jul 22 13:30:52 FRPLZABPXY02 corosync[11552]: [MAIN ] Corosync main process was not scheduled for 9902.1504 ms (threshold is 800.0000 ms). Consider token timeout increase.
Machine was not scheduled for 9 second, so 4 second token timeout is not
enough.
Regards,
Honza
> Jul 22 13:30:52 FRPLZABPXY02 corosync[11552]: [TOTEM ] A processor failed, forming new configuration.
> Jul 22 13:31:03 FRPLZABPXY02 corosync[11552]: [TOTEM ] A new membership (10.XX.YY.1:5808) was formed. Members joined: 1 left: 1
> Jul 22 13:31:03 FRPLZABPXY02 corosync[11552]: [TOTEM ] Failed to receive the leave message. failed: 1
> Jul 22 13:31:03 FRPLZABPXY02 corosync[11552]: [QUORUM] Members[2]: 1 2
> Jul 22 13:31:03 FRPLZABPXY02 corosync[11552]: [MAIN ] Completed service synchronization, ready to provide service.
>
>
> NODE1
> Jul 22 13:30:55 FRPLZABPXY01 corosync[1110]: [TOTEM ] A processor failed, forming new configuration.
> Jul 22 13:30:56 FRPLZABPXY01 corosync[1110]: [TOTEM ] A new membership (10.XX.YY.1:5804) was formed. Members left: 2
> Jul 22 13:30:56 FRPLZABPXY01 corosync[1110]: [TOTEM ] Failed to receive the leave message. failed: 2
> Jul 22 13:30:56 FRPLZABPXY01 corosync[1110]: [QUORUM] Members[1]: 1
> Jul 22 13:30:56 FRPLZABPXY01 corosync[1110]: [MAIN ] Completed service synchronization, ready to provide service.
> Jul 22 13:31:03 FRPLZABPXY01 corosync[1110]: [TOTEM ] A new membership (10.XX.YY.1:5808) was formed. Members joined: 2
> Jul 22 13:31:03 FRPLZABPXY01 corosync[1110]: [QUORUM] Members[2]: 1 2
> Jul 22 13:31:03 FRPLZABPXY01 corosync[1110]: [MAIN ] Completed service synchronization, ready to provide service.
>
>
> cat /etc/corosync/corosync.conf
> totem {
> version: 2
> secauth: off
> cluster_name: FRPLZABPXY
> transport: udpu
> totem: 4000
> interface {
> ringnumber: 0
> bindnetaddr: 10.XX.YY.2
> broadcast: yes
> mcastport: 5405
> }
> }
> nodelist {
> node {
> ring0_addr: 10.XX.YY.1
> name: FRPLZABPXY01
> nodeid: 1
> }
>
> node {
> ring0_addr: 10.XX.YY.2
> name: FRPLZABPXY02
> nodeid: 2
> }
> }
> quorum {
> provider: corosync_votequorum
> two_node: 1
> }
> logging {
> to_logfile: yes
> logfile: /var/log/cluster/corosync.log
> to_syslog: yes
> }
>
>
> sudo corosync-cmapctl | grep -E "(.config.totem.|^totem.)"
> runtime.config.totem.consensus (u32) = 1200
> runtime.config.totem.downcheck (u32) = 1000
> runtime.config.totem.fail_recv_const (u32) = 2500
> runtime.config.totem.heartbeat_failures_allowed (u32) = 0
> runtime.config.totem.hold (u32) = 180
> runtime.config.totem.join (u32) = 50
> runtime.config.totem.max_messages (u32) = 17
> runtime.config.totem.max_network_delay (u32) = 50
> runtime.config.totem.merge (u32) = 200
> runtime.config.totem.miss_count_const (u32) = 5
> runtime.config.totem.rrp_autorecovery_check_timeout (u32) = 1000
> runtime.config.totem.rrp_problem_count_mcast_threshold (u32) = 100
> runtime.config.totem.rrp_problem_count_threshold (u32) = 10
> runtime.config.totem.rrp_problem_count_timeout (u32) = 2000
> runtime.config.totem.rrp_token_expired_timeout (u32) = 238
> runtime.config.totem.send_join (u32) = 0
> runtime.config.totem.seqno_unchanged_const (u32) = 30
> runtime.config.totem.token (u32) = 1000
> runtime.config.totem.token_retransmit (u32) = 238
> runtime.config.totem.token_retransmits_before_loss_const (u32) = 4
> runtime.config.totem.window_size (u32) = 50
> totem.cluster_name (str) = FRPLZABPXY
> totem.interface.0.bindnetaddr (str) = 10.XX.YY.2
> totem.interface.0.broadcast (str) = yes
> totem.interface.0.mcastport (u16) = 5405
> totem.secauth (str) = off
> totem.totem (str) = 4000
> totem.transport (str) = udpu
> totem.version (u32) = 2
>
>
>
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
More information about the Users
mailing list