[ClusterLabs] Is "Process pause detected" triggered too easily?

Tue Oct 3 21:57:43 CEST 2017

Hi Jan,

On Tue, 3 Oct 2017, Jan Friesse wrote:

> > I hope this makes sense! :)
> 
> I would still have some questions :) but that is really not related to 
> the problem you have.

Questions are welcome! I am new to this stack, so there is certainly room 
for learning and for improvement.

> My personal favorite is consensus timeout. Because you've set (and I 
> must say according to doc correctly) consensus timeout to 3600 (= 1.2 * 
> token). Problem is, that result token timeout is not 3000, but with 5 
> nodes it is actually 3000 (base token) + (no_nodes - 2) * 650 ms = 4950 
> (as you can check by observing runtime.config.totem.token key). So it 
> may make sense to set consensus timeout to ~6000.

Could you clarify the formula for me? I don't see how "- 2" and "650" map 
to this configuration.

And I suppose that on our bigger system (20+5 servers) we need to greatly 
increase the consensus timeout.

Overall, tuning the timeouts seems related to be Black Magic. ;) I liked 
the idea suggested in an old thread that there would be a spreadsheet (or 
even just plain formulas) exposing the relation between the various knobs.

One thing I wonder is: would it make sense to annotate the state machine 
diagram in the Totem paper (page 15 of 
http://www.cs.jhu.edu/~yairamir/tocs.ps.gz) with those tunables? Assuming 
the paper still reflects the behavior of the current code.

> This doesn't change the fact that "bug" is reproducible even with 
> "correct" consensus, so I will continue working on this issue.

Great! Thanks for taking the time to investigate.

Cheers,
JM

-- 
saffroy at gmail.com