[ClusterLabs] Cluster node loss detection.
Vallevand, Mark K
Mark.Vallevand at UNISYS.com
Fri Oct 16 12:37:36 EDT 2015
Fencing, yes. I have pcmk-redirect for each node in cluster.conf.
I run with default cman settings for corosync. No totem clause. That gives the 20s detection. Not sure what the defaults really are.
I added <totem token="1000" token_retransmits_before_loss_const="5" /> to cluster.conf and get about a 5s detection.
The corosync man page says:
token This timeout specifies in milliseconds until a token loss is declared after not receiving a token. This is the time spent detecting a
failure of a processor in the current configuration. Reforming a new configuration takes about 50 milliseconds in addition to this
The default is 1000 milliseconds.
This timeout specifies in milliseconds after how long before receiving a token the token is retransmitted. This will be automatically
calculated if token is modified. It is not recommended to alter this value without guidance from the corosync community.
The default is 238 milliseconds.
hold This timeout specifies in milliseconds how long the token should be held by the representative when the protocol is under low utiliza‐
tion. It is not recommended to alter this value without guidance from the corosync community.
The default is 180 milliseconds.
This value identifies how many token retransmits should be attempted before forming a new configuration. If this value is set,
retransmit and hold will be automatically calculated from retransmits_before_loss and token.
The default is 4 retransmissions.
But, I don't know what cman sets these to. But, they aren't these values. And, they aren't the values in the cman man page, which says this:
Cman uses different defaults for some of the corosync parameters listed in corosync.conf(5). If you wish to use a non-default set‐
ting, they can be configured in cluster.conf as shown above. Cman uses the following default values:
<!-- or rrp_mode="active" if altnames are present >
So, it looks like setting the corosync parameters in cluster.conf has some effect. Cman seems to pass them to corosync.
Mark K Vallevand Mark.Vallevand at Unisys.com <mailto:Mark.Vallevand at Unisys.com>
Never try and teach a pig to sing: it's a waste of time, and it annoys the pig.
THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers.
From: Digimer [mailto:lists at alteeve.ca]
Sent: Friday, October 16, 2015 11:18 AM
To: Cluster Labs - All topics related to open-source clustering welcomed
Subject: Re: [ClusterLabs] Cluster node loss detection.
On 16/10/15 11:40 AM, Vallevand, Mark K wrote:
> Thanks. I wasn't completely aware of corosync's role in this. I see new things in the docs every time I read them.
> I looked up the corosync settings at one time and did it again:
> token loss 3000ms
> retransmits 10
> So 30s. Redid my simple testing and got detection times of 22s, 26s, and 25s using very crude methods.
> Any warnings about setting these values to something else?
> We require our customers to use an isolated, private network for cluster communications. All taken care of in our instructions and cluster configuration scripts. Network traffic will not be a factor. So, I'm thinking 1000ms and 5 retransmits as an experiment.
That is very high. I think the default is something like 236ms x 4 losses.
You do have fencing, right?
> I was pretty sure that DLM was just being informed by clustering, but I needed to ask.
> Again, thanks.
> Mark K Vallevand Mark.Vallevand at Unisys.com <mailto:Mark.Vallevand at Unisys.com>
> Never try and teach a pig to sing: it's a waste of time, and it annoys the pig.
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?
Users mailing list: Users at clusterlabs.org
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
More information about the Users