[ClusterLabs] Strange Corosync (TOTEM) logs, Pacemaker OK but DLM stuck

Tue Aug 29 10:42:27 EDT 2017

Jan Friesse <jfriesse at redhat.com> writes:

> wferi at niif.hu writes:
>
>> In a 6-node cluster (vhbl03-08) the following happens 1-5 times a day
>> (in August; in May, it happened 0-2 times a day only, it's slowly
>> ramping up):
>>
>> vhbl08 corosync[3687]:   [TOTEM ] A processor failed, forming new configuration.
>> vhbl03 corosync[3890]:   [TOTEM ] A processor failed, forming new configuration.
>> vhbl07 corosync[3805]:   [MAIN  ] Corosync main process was not scheduled for 4317.0054 ms (threshold is 2400.0000 ms). Consider token timeout increase.
>
> ^^^ This is main problem you have to solve. It usually means that
> machine is too overloaded. It is happening quite often when corosync
> is running inside VM where host machine is unable to schedule regular
> VM running.

Hi Honza,

Corosync isn't running in a VM here, these nodes are 2x8 core servers
hosting VMs themselves as Pacemaker resources.  (Incidentally, some of
these VMs run Corosync to form a test cluster, but that should be
irrelevant now.)  And they aren't overloaded in any apparent way: Munin
reports 2900% CPU idle (out of 32 hyperthreads).  There's no swap, but
the corosync process is locked into memory anyway.  It's also running as
SCHED_RR prio 99, competing only with multipathd and the SCHED_FIFO prio
99 kernel threads (migration/* and watchdog/*) under Linux 4.9.  I'll
try to take a closer look at the scheduling of these.  Can you recommend
some indicators to check out?

Are scheduling delays expected to generate TOTEM membership "changes"
without any leaving and joining nodes?

> As a start you can try what message say = Consider token timeout
> increase. Currently you have 3 seconds, in theory 6 second should be
> enough.

OK, thanks for the tip.  Can I do this on-line, without shutting down
Corosync?
-- 
Thanks,
Feri