[ClusterLabs] Strange Corosync (TOTEM) logs, Pacemaker OK but DLM stuck

Wed Aug 30 02:54:14 EDT 2017

Ferenc,

> Jan Friesse <jfriesse at redhat.com> writes:
>
>> wferi at niif.hu writes:
>>
>>> Jan Friesse <jfriesse at redhat.com> writes:
>>>
>>>> wferi at niif.hu writes:
>>>>
>>>>> In a 6-node cluster (vhbl03-08) the following happens 1-5 times a day
>>>>> (in August; in May, it happened 0-2 times a day only, it's slowly
>>>>> ramping up):
>>>>>
>>>>> vhbl08 corosync[3687]:   [TOTEM ] A processor failed, forming new configuration.
>>>>> vhbl03 corosync[3890]:   [TOTEM ] A processor failed, forming new configuration.
>>>>> vhbl07 corosync[3805]:   [MAIN  ] Corosync main process was not scheduled for 4317.0054 ms (threshold is 2400.0000 ms). Consider token timeout increase.
>>>>
>>>> ^^^ This is main problem you have to solve. It usually means that
>>>> machine is too overloaded. It is happening quite often when corosync
>>>> is running inside VM where host machine is unable to schedule regular
>>>> VM running.
>>>
>>> Corosync isn't running in a VM here, these nodes are 2x8 core servers
>>> hosting VMs themselves as Pacemaker resources.  (Incidentally, some of
>>> these VMs run Corosync to form a test cluster, but that should be
>>> irrelevant now.)  And they aren't overloaded in any apparent way: Munin
>>> reports 2900% CPU idle (out of 32 hyperthreads).  There's no swap, but
>>> the corosync process is locked into memory anyway.  It's also running as
>>> SCHED_RR prio 99, competing only with multipathd and the SCHED_FIFO prio
>>> 99 kernel threads (migration/* and watchdog/*) under Linux 4.9.  I'll
>>> try to take a closer look at the scheduling of these.  Can you recommend
>>> some indicators to check out?
>>
>> No real hints. But one question. Are you 100% sure memory is locked?
>> Because we had problem where mlockall was called in wrong place so
>> corosync was actually not locked and it was causing similar issues.
>>
>> This behavior is fixed by
>> https://github.com/corosync/corosync/commit/238e2e62d8b960e7c10bfa0a8281d78ec99f3a26
>
> I based this assertion on the L flag in the ps STAT column.  The above
> commit should not affect me because I'm running corosync with the -f
> option:

Oh, ok. If you are running with -f then bug above doesn't affect you.

>
> $ ps l 3805
> F   UID   PID  PPID PRI  NI    VSZ   RSS WCHAN  STAT TTY        TIME COMMAND
> 4     0  3805     1 -100  - 247464 141016 -     SLsl ?        251:10 /usr/sbin/corosync -f
>
> By the way, are the above VSZ and RSS numbers reasonable?

yep, perfectly reasonable.

Regards,
   Honza

>
> One more thing: these servers run without any swap.
>
>>>> As a start you can try what message say = Consider token timeout
>>>> increase. Currently you have 3 seconds, in theory 6 second should be
>>>> enough.
>>>
>>> OK, thanks for the tip.  Can I do this on-line, without shutting down
>>> Corosync?
>>
>> Corosync way is to just edit/copy corosync.conf on all nodes and call
>> corosync-cfgtool -R on one of the nodes (crmsh/pcs may have better
>> way).
>
> Great, that's what I wanted to know: whether -R is expected to make this
> change effective.