[ClusterLabs] Is "Process pause detected" triggered too easily?

Mon Oct 2 17:37:47 CEST 2017

> On Wed, 27 Sep 2017, Jan Friesse wrote:
>
>> I don't think scheduling is the case. If scheduler would be the case
>> other message (Corosync main process was not scheduled for ...) would
>> kick in. This looks more like a something is blocked in totemsrp.
>
> Ah, interesting!
>
>>> Also, it looks like the side effect is that corosync drops important
>>> messages (I think "join" messages?), and I fear that this can lead to
>>
>> You mean membership join messages? Because there are a lot (327) of them
>> in log you've sent.
>
> Yes. In my test setup I didn't see any issue where we lost membership join
> messages, but the reason why I am looking into this is this:
>
> We had one problem on a real deployment of DLM+corosync (5 voters and 20
> non-voters, with dlm on those 20, for a specific application that uses

What you mean by voters and non-voters? There is 25 nodes in total and 
each of them is running corosync?

> libdlm). On a reboot of one server running just corosync (which thus did
> NOT run dlm), a large number of other servers got briefly evicted from the

This is kind of weird. AFAIK DLM is joining to CPG group and using CPG 
membership. So if DLM was not running on the node then other nodes 
joined to DLM CPG group should not even notice its leave.

> corosync ring; and when rejoining, dlm complained about a "stateful merge"
> which forces a reboot. Note, dlm fencing is disabled.
>
> In that system, it was "legal" for corosync to kick out these servers
> (they had zero vote), but it was highly unexpected (they were running
> fine) and the impact is high (reboot).

What you mean by zero vote? You mean DLM vote or corosync number of 
votes (related to quorum)?

>
> We did see "Process pause detected" in the logs on that system when the
> incident happened, which is why I think could be a clue.

I've tried to reproduce the problem and I was not successful with 3 
nodes cluster using more or less default config (not changing 
join/consensus/...). I'll try 5 nodes possibly with totem values and see 
if problem appears.

Regards,
   Honza

>
>> I'll definitively try to reproduce this bug and let you know. I don't
>> think any message get lost, but it's better to be on a safe side.
>
> Thanks!
>
>
> Cheers,
> JM
>