[ClusterLabs] Is "Process pause detected" triggered too easily?

Tue Oct 3 12:00:59 EDT 2017

Jean,

> On Mon, 2 Oct 2017, Jan Friesse wrote:
>
>>> We had one problem on a real deployment of DLM+corosync (5 voters and 20
>>> non-voters, with dlm on those 20, for a specific application that uses
>>
>> What you mean by voters and non-voters? There is 25 nodes in total and
>> each of them is running corosync?
>
> Yes, there are 25 servers running corosync:
>
> - 5 are configured to have one vote for quorum, on these servers corosync
> serves no other purpose
>
> - 20 have zero vote for quorum, and these servers also run DLM and the
> application that uses DLM
>
> The intent with this configuration is:
>
> - to avoid split brain in case of network partition: application servers
> must be in the same partition as the quorum majority (so, 3 of the 5
> "voters") to carry on their operations
>
> - to allow independent failure of any number of application servers
>
> I hope this makes sense! :)

I would still have some questions :) but that is really not related to 
the problem you have.

>
>>> libdlm). On a reboot of one server running just corosync (which thus did
>>> NOT run dlm), a large number of other servers got briefly evicted from the
>>
>> This is kind of weird. AFAIK DLM is joining to CPG group and using CPG
>> membership. So if DLM was not running on the node then other nodes joined to
>> DLM CPG group should not even notice its leave.
>
> Indeed, but we saw "Process pause detected" on all servers, and corosync
> temporarily formed an operational cluster excluding most of the
> "non-voters" (those with zero quorum vote). Then most servers joined back,
> but then DLM complained about the "stateful merge".

Yep, when this was on all servers it's a huge problem and it explains a lot.

>
>> What you mean by zero vote? You mean DLM vote or corosync number of
>> votes (related to quorum)?
>
> I mean the vote in the corosync quorum, I'm not aware of anything like
> that with DLM (or maybe you could think of the per-server weight when one
> manually defines servers that master locks in a lock space, but we don't
> use that).

Got it.

>
>> I've tried to reproduce the problem and I was not successful with 3
>> nodes cluster using more or less default config (not changing
>> join/consensus/...). I'll try 5 nodes possibly with totem values and see
>> if problem appears.
>
> I've tried again today, and first with just 3 servers (VMs), using the
> same config I sent earlier (which has 3 nodes with 1 vote, 2 nodes with 0
> vote), I could no longer reproduce either. Then I spawned 2 more VMs and
> had them join the existing 3-node cluster (those I added were the 2
> servers with 0 vote), and then I saw the "Process pause ..." log line. And
> now I have stopped the last 2 servers, and I am back to just 3, and I keep
> seeing that log line.
>
> If you're still curious and if that's useful, I can try to reproduce on a
> set of VMs where I could give you full ssh access.

So good news is that I was able to reproduce it. Even better news is, 
that I was able to reproduce it even without changing join/consensus/... 
parameters. What's even better that with parameters changed it's 
becoming much easier to reproduce the problem. So in theory if I will be 
able to identify that parameter, it may make sense to increase/decrease 
that parameter close to infinity/0 and debugging should then become easy.

My personal favorite is consensus timeout. Because you've set (and I 
must say according to doc correctly) consensus timeout to 3600 (= 1.2 * 
token). Problem is, that result token timeout is not 3000, but with 5 
nodes it is actually 3000 (base token) + (no_nodes - 2) * 650 ms = 4950 
(as you can check by observing runtime.config.totem.token key). So it 
may make sense to set consensus timeout to ~6000.

This doesn't change the fact that "bug" is reproducible even with 
"correct" consensus, so I will continue working on this issue.

Honza

>
>
> Thanks!
>
> Cheers,
> JM
>