[Pacemaker] cluster-delay property

Thu Oct 24 10:12:51 EDT 2013

Am Donnerstag, 24. Oktober 2013, 10:06:20 schrieben Sie:
> On 24/10/13 09:01, Michael Schwartzkopff wrote:
> > Am Donnerstag, 24. Oktober 2013, 14:39:39 schrieb Karl Rößmann:
> >> Sorry, I try to explain
> >> 
> >> Hi
> >> 
> >> In your book you describe a parameter 'deadtime' which defines
> >> the timeout to declare a node as dead. I want to extend this
> >> value to 120s to avoid such a scenario
> >> 
> >> But: in the SuSE documentation I cannot find 'deadtime', instead
> >> I see a value 'cluster-delay'. My Question is: Are these two
> >> parameters equivalent ?
> >> 
> >> More details about the scenario: The I/O load was created by me,
> >> because I copied a large xen image to an logical volume of the
> >> cLVM (using 'dd'). I did it several times before without
> >> problems. Maybe something changed after upgrading tu SLES SP3.
> >> 
> >> One node, (it was the DC) died, the Xen resources went to the
> >> surviving node. Fine.
> >> 
> >> No information in the log file.
> >> 
> >> On the the surviving node I see: Oct 23 09:30:41 ha2infra
> >> corosync[9085]:  [TOTEM ] A processor failed, forming new
> >> configuration.
> > 
> > (...)
> > 
> > the log says that corosync did not see the node. This is not a
> > pacemaker problem.
> > 
> > I speculate that this happened because one node was heavily
> > overloaded doing the dd and did not find to process the corosync
> > tokens in time. Or perhaps the load on the network was so high that
> > corosync packets were dropped.
> > 
> > Anyway: This is not a pacemaker problem, it is a corosync problem.
> > 
> > If you want to make corosync bahave a little bit more relaxed
> > please see "man corosync.conf" for the options. Look for the
> > options token and the following options. I don't know what options
> > are available in SLES11 HAE3. corosync is under heavy improvement
> > ;-)
> > 
> > If you have a question for a specific option please ask here on the
> > list.
> 
> I agree with Michael that this is a corosync problem. I also agree
> that this is a congestion problem. The variable you are looking for is
> token_retransmit, if I am correct.
> 
> I would argue that the better solution is not to adjust this value,
> but to fixed your architecture to separate corosync/pacemaker traffic
> from the disk/dd traffic. If you increase token_retransmit, you will
> delay how long real failures take to be detected, thus slowing down
> recovery.

Of course, fiddeling around with the token_retransmit option doesn't solve the 
problem. It just cures the symptoms.

Perhaps you limit the transfer rate of dd. google for "dd rate limit". There 
are several solutions. rsync/csync could be a solution.

Also you could think about improving your disk I/O sub-system.

But you better know what the bottle neck in your system is and how to solve 
it.

Mit freundlichen Grüßen,

Michael Schwartzkopff

-- 
[*] sys4 AG

http://sys4.de, +49 (89) 30 90 46 64, +49 (162) 165 0044
Franziskanerstraße 15, 81669 München

Sitz der Gesellschaft: München, Amtsgericht München: HRB 199263
Vorstand: Patrick Ben Koetter, Axel von der Ohe, Marc Schiffbauer
Aufsichtsratsvorsitzender: Florian Kirstein
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 230 bytes
Desc: This is a digitally signed message part.
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20131024/685f5a10/attachment-0003.sig>