[ClusterLabs] Cluster node loss detection.

Vallevand, Mark K Mark.Vallevand at UNISYS.com
Fri Oct 16 15:40:47 UTC 2015


Thanks.  I wasn't completely aware of corosync's role in this.  I see new things in the docs every time I read them.

I looked up the corosync settings at one time and did it again:
	token loss 3000ms
	retransmits 10
So 30s.  Redid my simple testing and got detection times of 22s, 26s, and 25s using very crude methods.
Any warnings about setting these values to something else?
We require our customers to use an isolated, private network for cluster communications.  All taken care of in our instructions and cluster configuration scripts.  Network traffic will not be a factor.  So, I'm thinking 1000ms and 5 retransmits as an experiment.

I was pretty sure that DLM was just being informed by clustering, but I needed to ask.

Again, thanks.
	

Regards.
Mark K Vallevand   Mark.Vallevand at Unisys.com <mailto:Mark.Vallevand at Unisys.com> 
Never try and teach a pig to sing: it's a waste of time, and it annoys the pig.

THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers.


-----Original Message-----
From: Digimer [mailto:lists at alteeve.ca] 
Sent: Friday, October 16, 2015 10:04 AM
To: Cluster Labs - All topics related to open-source clustering welcomed
Subject: Re: [ClusterLabs] Cluster node loss detection.

On 16/10/15 10:51 AM, Vallevand, Mark K wrote:
> It looks like it takes 20s for a cluster to detect that a node has been
> lost.

Loss is detected by corosync, and it declares loss after X lost totem
tokens, each token being declared lost after Y milliseconds. By default,
node loss should be detected in about 1 second of no network traffic,
but you need to check corosync's settings.

> The detection seems to correlate to dlm reporting its lost connection to
> the node.

Negative. DLM is informed when a node is declared lost and blocks until
fenced/stonithd tells it that the peer has been successfully fenced.
After which time, it reaps lost locks and recovers.

> Not sure if correlation is causation.

Correlation.

> Anyway, can someone tell me where that 20s might be coming from and if
> it is adjustable? 
> 
> Ubuntu 12.04 LTS
> pacemaker 1.1.10
>  cman 3.1.7
> corosync 1.4.6
> 
> Thanks!
> 
>  
> 
> Regards.
> Mark K Vallevand   Mark.Vallevand at Unisys.com
> <mailto:Mark.Vallevand at Unisys.com>
> Never try and teach a pig to sing: it's a waste of time, and it annoys
> the pig.
> 
> THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY
> MATERIAL and is thus for use only by the intended recipient. If you
> received this in error, please contact the sender and delete the e-mail
> and its attachments from all computers.

This suffix has zero legal bearing, just saying. Anything posted to this
list is 100% open and public.

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?

_______________________________________________
Users mailing list: Users at clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org




More information about the Users mailing list