[ClusterLabs] Strange Corosync (TOTEM) logs, Pacemaker OK but DLM stuck

Klaus Wenninger kwenning at redhat.com
Wed Aug 30 03:57:31 EDT 2017


On 08/30/2017 08:54 AM, Jan Friesse wrote:
> Ferenc,
>
>> Jan Friesse <jfriesse at redhat.com> writes:
>>
>>> wferi at niif.hu writes:
>>>
>>>> Jan Friesse <jfriesse at redhat.com> writes:
>>>>
>>>>> wferi at niif.hu writes:
>>>>>
>>>>>> In a 6-node cluster (vhbl03-08) the following happens 1-5 times a
>>>>>> day
>>>>>> (in August; in May, it happened 0-2 times a day only, it's slowly
>>>>>> ramping up):
>>>>>>
>>>>>> vhbl08 corosync[3687]:   [TOTEM ] A processor failed, forming new
>>>>>> configuration.
>>>>>> vhbl03 corosync[3890]:   [TOTEM ] A processor failed, forming new
>>>>>> configuration.
>>>>>> vhbl07 corosync[3805]:   [MAIN  ] Corosync main process was not
>>>>>> scheduled for 4317.0054 ms (threshold is 2400.0000 ms). Consider
>>>>>> token timeout increase.
>>>>>
>>>>> ^^^ This is main problem you have to solve. It usually means that
>>>>> machine is too overloaded. It is happening quite often when corosync
>>>>> is running inside VM where host machine is unable to schedule regular
>>>>> VM running.
>>>>
>>>> Corosync isn't running in a VM here, these nodes are 2x8 core servers
>>>> hosting VMs themselves as Pacemaker resources.  (Incidentally, some of
>>>> these VMs run Corosync to form a test cluster, but that should be
>>>> irrelevant now.)  And they aren't overloaded in any apparent way:
>>>> Munin
>>>> reports 2900% CPU idle (out of 32 hyperthreads).  There's no swap, but
>>>> the corosync process is locked into memory anyway.  It's also
>>>> running as
>>>> SCHED_RR prio 99, competing only with multipathd and the SCHED_FIFO
>>>> prio
>>>> 99 kernel threads (migration/* and watchdog/*) under Linux 4.9.  I'll
>>>> try to take a closer look at the scheduling of these.  Can you
>>>> recommend
>>>> some indicators to check out?

Just seen that you are hosting VMs which might make you use KSM ...
Don't fully remember at the moment but I have some memory of
issues with KSM and page-locking.
iirc it was some bug in the kernel memory-management that should
be fixed a long time ago but ...

Regards,
Klaus

>>>>
>>>
>>> No real hints. But one question. Are you 100% sure memory is locked?
>>> Because we had problem where mlockall was called in wrong place so
>>> corosync was actually not locked and it was causing similar issues.
>>>
>>> This behavior is fixed by
>>> https://github.com/corosync/corosync/commit/238e2e62d8b960e7c10bfa0a8281d78ec99f3a26
>>>
>>
>> I based this assertion on the L flag in the ps STAT column.  The above
>> commit should not affect me because I'm running corosync with the -f
>> option:
>
> Oh, ok. If you are running with -f then bug above doesn't affect you.
>
>>
>> $ ps l 3805
>> F   UID   PID  PPID PRI  NI    VSZ   RSS WCHAN  STAT TTY        TIME
>> COMMAND
>> 4     0  3805     1 -100  - 247464 141016 -     SLsl ?        251:10
>> /usr/sbin/corosync -f
>>
>> By the way, are the above VSZ and RSS numbers reasonable?
>
> yep, perfectly reasonable.
>
> Regards,
>   Honza
>
>>
>> One more thing: these servers run without any swap.
>>
>>>>> As a start you can try what message say = Consider token timeout
>>>>> increase. Currently you have 3 seconds, in theory 6 second should be
>>>>> enough.
>>>>
>>>> OK, thanks for the tip.  Can I do this on-line, without shutting down
>>>> Corosync?
>>>
>>> Corosync way is to just edit/copy corosync.conf on all nodes and call
>>> corosync-cfgtool -R on one of the nodes (crmsh/pcs may have better
>>> way).
>>
>> Great, that's what I wanted to know: whether -R is expected to make this
>> change effective.
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org





More information about the Users mailing list