[ClusterLabs] Strange Corosync (TOTEM) logs, Pacemaker OK but DLM stuck

Fri Sep 1 02:40:35 EDT 2017

Ferenc,
> Jan Friesse <jfriesse at redhat.com> writes:
>
>> wferi at niif.hu writes:
>>
>>> In a 6-node cluster (vhbl03-08) the following happens 1-5 times a day
>>> (in August; in May, it happened 0-2 times a day only, it's slowly
>>> ramping up):
>>>
>>> vhbl08 corosync[3687]:   [TOTEM ] A processor failed, forming new configuration.
>>> vhbl03 corosync[3890]:   [TOTEM ] A processor failed, forming new configuration.
>>> vhbl07 corosync[3805]:   [MAIN  ] Corosync main process was not scheduled for 4317.0054 ms (threshold is 2400.0000 ms). Consider token timeout increase.
>>
>> ^^^ This is main problem you have to solve. It usually means that
>> machine is too overloaded. [...]
>
> Before I start tracing the scheduler, I'd like to ask something: what
> wakes up the Corosync main process periodically?  The token making a
> full circle?  (Please forgive my simplistic understanding of the TOTEM
> protocol.)  That would explain the recommendation in the log message,
> but does not fit well with the overload assumption: totally idle nodes
> could just as easily produce such warnings if there are no other regular
> wakeup sources.  (I'm looking at timer_function_scheduler_timeout but I
> know too little of libqb to decide.)

Corosync main loop is based on epoll, so corosync is waked up ether by 
receiving data (network socket or unix socket for services) or when 
there are data to sent and socket is ready for non blocking write or 
after timeout. This timeout is exactly what you call other wakeup resource.

Timeout is used for scheduling periodical tasks inside corosync.

One of periodical tasks is scheduler pause detector. It is basically 
scheduled every (token_timeout / 3) msec and it computes diff between 
current and last time. If diff is larger than (token_timeout * 0.8) it 
displays warning.

>
>> As a start you can try what message say = Consider token timeout
>> increase. Currently you have 3 seconds, in theory 6 second should be
>> enough.
>
> It was probably high time I realized that token timeout is scaled
> automatically when one has a nodelist.  When you say Corosync should
> work OK with default settings up to 16 nodes, you assume this scaling is
> in effect, don't you?  On the other hand, I've got no nodelist in the
> config, but token = 3000, which is less than the default 1000+4*650 with
> six nodes, and this will get worse as the cluster grows.

This is described in corosync.conf man page (token_coefficient).

Final timeout is computed using totem.token as a base value. So if you 
set totem.token to 3000 it means that final totem timeout value is not 
3000 but (3000 + 4 * 650).

Regards,
   Honza

>
> Comments on the above ramblings welcome!
>
> I'm grateful for all the valuable input poured into this thread by all
> parties: it's proven really educative in quite unexpected ways beyond
> what I was able to ask in the beginning.
>