[ClusterLabs] Is "Process pause detected" triggered too easily?

Wed Sep 27 11:39:53 EDT 2017

On Wed, 27 Sep 2017, Jan Friesse wrote:

> I don't think scheduling is the case. If scheduler would be the case 
> other message (Corosync main process was not scheduled for ...) would 
> kick in. This looks more like a something is blocked in totemsrp.

Ah, interesting!

> > Also, it looks like the side effect is that corosync drops important
> > messages (I think "join" messages?), and I fear that this can lead to
> 
> You mean membership join messages? Because there are a lot (327) of them 
> in log you've sent.

Yes. In my test setup I didn't see any issue where we lost membership join 
messages, but the reason why I am looking into this is this:

We had one problem on a real deployment of DLM+corosync (5 voters and 20 
non-voters, with dlm on those 20, for a specific application that uses 
libdlm). On a reboot of one server running just corosync (which thus did 
NOT run dlm), a large number of other servers got briefly evicted from the 
corosync ring; and when rejoining, dlm complained about a "stateful merge" 
which forces a reboot. Note, dlm fencing is disabled.

In that system, it was "legal" for corosync to kick out these servers 
(they had zero vote), but it was highly unexpected (they were running 
fine) and the impact is high (reboot).

We did see "Process pause detected" in the logs on that system when the 
incident happened, which is why I think could be a clue.

> I'll definitively try to reproduce this bug and let you know. I don't 
> think any message get lost, but it's better to be on a safe side.

Thanks!

Cheers,
JM

-- 
saffroy at gmail.com