[ClusterLabs] Strange Corosync (TOTEM) logs, Pacemaker OK but DLM stuck

Tue Aug 29 11:06:47 EDT 2017

Ferenc,

> Jan Friesse <jfriesse at redhat.com> writes:
>
>> wferi at niif.hu writes:
>>
>>> In a 6-node cluster (vhbl03-08) the following happens 1-5 times a day
>>> (in August; in May, it happened 0-2 times a day only, it's slowly
>>> ramping up):
>>>
>>> vhbl08 corosync[3687]:   [TOTEM ] A processor failed, forming new configuration.
>>> vhbl03 corosync[3890]:   [TOTEM ] A processor failed, forming new configuration.
>>> vhbl07 corosync[3805]:   [MAIN  ] Corosync main process was not scheduled for 4317.0054 ms (threshold is 2400.0000 ms). Consider token timeout increase.
>>
>> ^^^ This is main problem you have to solve. It usually means that
>> machine is too overloaded. It is happening quite often when corosync
>> is running inside VM where host machine is unable to schedule regular
>> VM running.
>
> Hi Honza,
>
> Corosync isn't running in a VM here, these nodes are 2x8 core servers
> hosting VMs themselves as Pacemaker resources.  (Incidentally, some of
> these VMs run Corosync to form a test cluster, but that should be
> irrelevant now.)  And they aren't overloaded in any apparent way: Munin
> reports 2900% CPU idle (out of 32 hyperthreads).  There's no swap, but
> the corosync process is locked into memory anyway.  It's also running as
> SCHED_RR prio 99, competing only with multipathd and the SCHED_FIFO prio
> 99 kernel threads (migration/* and watchdog/*) under Linux 4.9.  I'll
> try to take a closer look at the scheduling of these.  Can you recommend
> some indicators to check out?

No real hints. But one question. Are you 100% sure memory is locked? 
Because we had problem where mlockall was called in wrong place so 
corosync was actually not locked and it was causing similar issues.

This behavior is fixed by
https://github.com/corosync/corosync/commit/238e2e62d8b960e7c10bfa0a8281d78ec99f3a26

>
> Are scheduling delays expected to generate TOTEM membership "changes"
> without any leaving and joining nodes?

Yes it is

>
>> As a start you can try what message say = Consider token timeout
>> increase. Currently you have 3 seconds, in theory 6 second should be
>> enough.
>
> OK, thanks for the tip.  Can I do this on-line, without shutting down
> Corosync?
>

Corosync way is to just edit/copy corosync.conf on all nodes and call 
corosync-cfgtool -R on one of the nodes (crmsh/pcs may have better way).

Regards,
   Honza