[ClusterLabs] Is "Process pause detected" triggered too easily?

Mon Oct 2 21:11:15 CEST 2017

On Mon, 2 Oct 2017, Jan Friesse wrote:

> > We had one problem on a real deployment of DLM+corosync (5 voters and 20
> > non-voters, with dlm on those 20, for a specific application that uses
> 
> What you mean by voters and non-voters? There is 25 nodes in total and 
> each of them is running corosync?

Yes, there are 25 servers running corosync:

- 5 are configured to have one vote for quorum, on these servers corosync 
serves no other purpose

- 20 have zero vote for quorum, and these servers also run DLM and the 
application that uses DLM

The intent with this configuration is:

- to avoid split brain in case of network partition: application servers 
must be in the same partition as the quorum majority (so, 3 of the 5 
"voters") to carry on their operations

- to allow independent failure of any number of application servers

I hope this makes sense! :)

> > libdlm). On a reboot of one server running just corosync (which thus did
> > NOT run dlm), a large number of other servers got briefly evicted from the
> 
> This is kind of weird. AFAIK DLM is joining to CPG group and using CPG
> membership. So if DLM was not running on the node then other nodes joined to
> DLM CPG group should not even notice its leave.

Indeed, but we saw "Process pause detected" on all servers, and corosync 
temporarily formed an operational cluster excluding most of the 
"non-voters" (those with zero quorum vote). Then most servers joined back, 
but then DLM complained about the "stateful merge".

> What you mean by zero vote? You mean DLM vote or corosync number of 
> votes (related to quorum)?

I mean the vote in the corosync quorum, I'm not aware of anything like 
that with DLM (or maybe you could think of the per-server weight when one 
manually defines servers that master locks in a lock space, but we don't 
use that).

> I've tried to reproduce the problem and I was not successful with 3 
> nodes cluster using more or less default config (not changing 
> join/consensus/...). I'll try 5 nodes possibly with totem values and see 
> if problem appears.

I've tried again today, and first with just 3 servers (VMs), using the 
same config I sent earlier (which has 3 nodes with 1 vote, 2 nodes with 0 
vote), I could no longer reproduce either. Then I spawned 2 more VMs and 
had them join the existing 3-node cluster (those I added were the 2 
servers with 0 vote), and then I saw the "Process pause ..." log line. And 
now I have stopped the last 2 servers, and I am back to just 3, and I keep 
seeing that log line.

If you're still curious and if that's useful, I can try to reproduce on a 
set of VMs where I could give you full ssh access.

Thanks!

Cheers,
JM

-- 
saffroy at gmail.com