[ClusterLabs] Is "Process pause detected" triggered too easily?

Jan Friesse jfriesse at redhat.com
Wed Oct 4 07:45:57 UTC 2017


Jean,

> Hi Jan,
>
> On Tue, 3 Oct 2017, Jan Friesse wrote:
>
>>> I hope this makes sense! :)
>>
>> I would still have some questions :) but that is really not related to
>> the problem you have.
>
> Questions are welcome! I am new to this stack, so there is certainly room
> for learning and for improvement.
>
>> My personal favorite is consensus timeout. Because you've set (and I
>> must say according to doc correctly) consensus timeout to 3600 (= 1.2 *
>> token). Problem is, that result token timeout is not 3000, but with 5
>> nodes it is actually 3000 (base token) + (no_nodes - 2) * 650 ms = 4950
>> (as you can check by observing runtime.config.totem.token key). So it
>> may make sense to set consensus timeout to ~6000.
>
> Could you clarify the formula for me? I don't see how "- 2" and "650" map
> to this configuration.

Since Corosync 2.3.4 when nodelist is used, totem.token is used only as 
a basis for calculating real token timeout. You can check corosync.conf 
man page for more information and formula.

>
> And I suppose that on our bigger system (20+5 servers) we need to greatly
> increase the consensus timeout.

Consensus timeout reflects token value so if it is not defined in config 
file it's computed as token * 1.2. This is not reflected in manpage and 
needs to be fixed.

>
> Overall, tuning the timeouts seems related to be Black Magic. ;) I liked

It is

> the idea suggested in an old thread that there would be a spreadsheet (or
> even just plain formulas) exposing the relation between the various knobs.

Idea is to compute it in the code directly. This is implemented for some 
parts, but sadly not for some other. Reason is mostly that it's quite 
hard to make these timeouts right, so failure detection is fast enough 
but there are as few false membership changes as possible.

>
> One thing I wonder is: would it make sense to annotate the state machine
> diagram in the Totem paper (page 15 of
> http://www.cs.jhu.edu/~yairamir/tocs.ps.gz) with those tunables? Assuming
> the paper still reflects the behavior of the current code.

Yes, code reflects paper (to some extend, some things are slightly 
different) and I really like idea of annotating it, or actually having 
wiki page with this diagram and slight documentation of totemsrp insides.

>
>> This doesn't change the fact that "bug" is reproducible even with
>> "correct" consensus, so I will continue working on this issue.
>
> Great! Thanks for taking the time to investigate.

Yep, np.

Honza

>
>
> Cheers,
> JM
>





More information about the Users mailing list