[ClusterLabs] Strange Corosync (TOTEM) logs, Pacemaker OK but DLM stuck
wferi at niif.hu
Tue Aug 29 13:16:03 EDT 2017
Jan Friesse <jfriesse at redhat.com> writes:
> wferi at niif.hu writes:
>> Jan Friesse <jfriesse at redhat.com> writes:
>>> wferi at niif.hu writes:
>>>> In a 6-node cluster (vhbl03-08) the following happens 1-5 times a day
>>>> (in August; in May, it happened 0-2 times a day only, it's slowly
>>>> ramping up):
>>>> vhbl08 corosync: [TOTEM ] A processor failed, forming new configuration.
>>>> vhbl03 corosync: [TOTEM ] A processor failed, forming new configuration.
>>>> vhbl07 corosync: [MAIN ] Corosync main process was not scheduled for 4317.0054 ms (threshold is 2400.0000 ms). Consider token timeout increase.
>>> ^^^ This is main problem you have to solve. It usually means that
>>> machine is too overloaded. It is happening quite often when corosync
>>> is running inside VM where host machine is unable to schedule regular
>>> VM running.
>> Corosync isn't running in a VM here, these nodes are 2x8 core servers
>> hosting VMs themselves as Pacemaker resources. (Incidentally, some of
>> these VMs run Corosync to form a test cluster, but that should be
>> irrelevant now.) And they aren't overloaded in any apparent way: Munin
>> reports 2900% CPU idle (out of 32 hyperthreads). There's no swap, but
>> the corosync process is locked into memory anyway. It's also running as
>> SCHED_RR prio 99, competing only with multipathd and the SCHED_FIFO prio
>> 99 kernel threads (migration/* and watchdog/*) under Linux 4.9. I'll
>> try to take a closer look at the scheduling of these. Can you recommend
>> some indicators to check out?
> No real hints. But one question. Are you 100% sure memory is locked?
> Because we had problem where mlockall was called in wrong place so
> corosync was actually not locked and it was causing similar issues.
> This behavior is fixed by
I based this assertion on the L flag in the ps STAT column. The above
commit should not affect me because I'm running corosync with the -f
$ ps l 3805
F UID PID PPID PRI NI VSZ RSS WCHAN STAT TTY TIME COMMAND
4 0 3805 1 -100 - 247464 141016 - SLsl ? 251:10 /usr/sbin/corosync -f
By the way, are the above VSZ and RSS numbers reasonable?
One more thing: these servers run without any swap.
>>> As a start you can try what message say = Consider token timeout
>>> increase. Currently you have 3 seconds, in theory 6 second should be
>> OK, thanks for the tip. Can I do this on-line, without shutting down
> Corosync way is to just edit/copy corosync.conf on all nodes and call
> corosync-cfgtool -R on one of the nodes (crmsh/pcs may have better
Great, that's what I wanted to know: whether -R is expected to make this
More information about the Users