[ClusterLabs] Corosync main process was not scheduled for 115935.2266 ms (threshold is 800.0000 ms). Consider token timeout increase.

Thu Feb 25 08:48:07 UTC 2016

Adam Spiers napsal(a):
> Hi all,
>
> Jan Friesse <jfriesse at redhat.com> wrote:
>>>>> There is really no help. It's best to make sure corosync is scheduled
>>> regularly.
>>> I may sound silly, but how can I do it?
>>
>> It's actually very hard to say. Pauses like 30 sec is really unusual
>> and shouldn't happen (specially with RT scheduling). It is usually
>> happening on VM where host is overcommitted.
>
> It's funny you are discussing this during the same period where my
> team is seeing this happen fairly regularly within VMs on a host which
> is overcommitted.  In other words, I can confirm Jan's statement above
> is true.

Yep, sadly VM affects scheduling a lot. For cluster nodes it actually 
really make sense to map every virtual cpu core to physical cpu core.

>
> Like Konstiantyn, we have also sometimes seen no fencing occur as a
> result of these pauses, e.g.
>
> Feb 24 02:53:04 d52-54-77-77-77-02 corosync[18939]:   [MAIN  ] Corosync main process was not scheduled for 7343.1909 ms (threshold is 4000.0000 ms). Consider token timeout increase.
> Feb 24 02:53:04 d52-54-77-77-77-02 corosync[18939]:   [TOTEM ] A processor failed, forming new configuration.
> Feb 24 02:53:04 d52-54-77-77-77-02 corosync[18939]:   [CLM   ] CLM CONFIGURATION CHANGE
> Feb 24 02:53:04 d52-54-77-77-77-02 corosync[18939]:   [CLM   ] New Configuration:
> Feb 24 02:53:04 d52-54-77-77-77-02 corosync[18939]:   [CLM   ] #011r(0) ip(192.168.2.82)
> Feb 24 02:53:04 d52-54-77-77-77-02 corosync[18939]:   [CLM   ] #011r(0) ip(192.168.2.84)
> Feb 24 02:53:04 d52-54-77-77-77-02 corosync[18939]:   [CLM   ] Members Left:
> Feb 24 02:53:04 d52-54-77-77-77-02 corosync[18939]:   [CLM   ] Members Joined:
> Feb 24 02:53:04 d52-54-77-77-77-02 corosync[18939]:   [pcmk  ] notice: pcmk_peer_update: Transitional membership event on ring 32: memb=2, new=0, lost=0
> Feb 24 02:53:04 d52-54-77-77-77-02 corosync[18939]:   [pcmk  ] info: pcmk_peer_update: memb: d52-54-77-77-77-01 1084752466
> Feb 24 02:53:04 d52-54-77-77-77-02 corosync[18939]:   [pcmk  ] info: pcmk_peer_update: memb: d52-54-77-77-77-02 1084752468
> Feb 24 02:53:04 d52-54-77-77-77-02 corosync[18939]:   [CLM   ] CLM CONFIGURATION CHANGE
> Feb 24 02:53:04 d52-54-77-77-77-02 corosync[18939]:   [CLM   ] New Configuration:
> Feb 24 02:53:04 d52-54-77-77-77-02 corosync[18939]:   [CLM   ] #011r(0) ip(192.168.2.82)
> Feb 24 02:53:04 d52-54-77-77-77-02 corosync[18939]:   [CLM   ] #011r(0) ip(192.168.2.84)
> Feb 24 02:53:04 d52-54-77-77-77-02 corosync[18939]:   [CLM   ] Members Left:
> Feb 24 02:53:04 d52-54-77-77-77-02 corosync[18939]:   [CLM   ] Members Joined:
> Feb 24 02:53:04 d52-54-77-77-77-02 corosync[18939]:   [pcmk  ] notice: pcmk_peer_update: Stable membership event on ring 32: memb=2, new=0, lost=0
> Feb 24 02:53:04 d52-54-77-77-77-02 corosync[18939]:   [pcmk  ] info: pcmk_peer_update: MEMB: d52-54-77-77-77-01 1084752466
> Feb 24 02:53:04 d52-54-77-77-77-02 corosync[18939]:   [pcmk  ] info: pcmk_peer_update: MEMB: d52-54-77-77-77-02 1084752468
> Feb 24 02:53:04 d52-54-77-77-77-02 corosync[18939]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
> Feb 24 02:53:04 d52-54-77-77-77-02 corosync[18939]:   [CPG   ] chosen downlist: sender r(0) ip(192.168.2.82) ; members(old:2 left:0)
> Feb 24 02:53:04 d52-54-77-77-77-02 corosync[18939]:   [MAIN  ] Completed service synchronization, ready to provide service.
>
> I don't understand why it claims a processor failed, forming a new
> configuration, when the configuration appears no different from
> before: no members joined or left.  Can anyone explain this?

Corosync uses token (very similar to old token-ring, in corosync used 
for concession control (only node with token can send messages) and 
ordering of messages) timeout (= maximum time to wait for token. If 
token doesn't arrive it's considered to be lost) to detect problems in 
network or more generally failed nodes. If corosync is not scheduled for 
a long time, it's same situation (from the affected node point of view) 
as lost token. Corosync can find out if it was not scheduled for token 
timeout but there is really no change in following steps. Node really 
cannot be sure if other nodes didn't create different membership 
(without affected node) so it has to go thru gather state (contact all 
nodes, get their world view, decide) even if (for given node) nothing 
really changed (other nodes may see it differently).

Hope it helps a bit.

Regards,
   Honza

>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>