[ClusterLabs] Corosync main process was not scheduled for 2889.8477 ms (threshold is 800.0000 ms), though it runs with realtime priority and there was not much load on the node

Wed Sep 4 05:21:06 EDT 2019

Hi Honza,

On Tue, Sep 3, 2019 at 7:20 PM Jan Friesse <jfriesse at redhat.com> wrote:

> Jeevan,
>
> Jeevan Patnaik napsal(a):
> >    Hi Honza,
> >
> >   Thanks for the response.
> >
> > If you increase token timeout even higher
> > (let's say 12sec) is it still appearing or not?
> > - I will try this.
> >
> >   If you try to run it without RT priority, does it help?
> > - Can RT priority affect the process scheduling negatively?
>
> Actually we've had report that it can, because it blocks kernel thread
> which is responsible for sending/receiving packets. I was not able to
> reporduce this behavior myself, and it seemed to be kernel specific, but
> resolution was that behavior without RT was better.
>
Thanks. I will check this. Also in theory, can blocking kernel thread
responsible for sending/receiving packets affect scheduling of the corosync
process (with RT priority) ?

>
> >
> > I don't see any irregular IO activity during the time when we got these
> > errors. Also, swap usage and swap IO is not much at all, it's only in
> KBs.
> > we have vm.swappiness set to 1. So, I don't think swap is causing any
> issue.
> >
> > However, I see slight network activity during the issue times (What I
> > understand is network activity should not affect the CPU jobs as long as
> > CPU load is normal and without any blocking IO).
>
> It shouldn't
>
> >
> > I am thinking of debugging in the following way, unless there is option
> to
> > restart corosync with debugger mode. :
>
> You can turn on debug messages (debug: on in logging section of
> corosync.conf).
>
> Yes, I found thist later. Will try debugging. Hoping it would help in
knowing where the problem is.

> >
> > -> Run a process strace in background on the corosync process and
> redirect
> > log to a output
> > -> Add a frequent cron job to rotate the output log (delete old ones),
> > unless there is a flag file to keep the old log
> > -> Add another frequent cron job to check corosync log for the specific
> > token timeout error and add the above mentioned flag file to not delete
> the
> > strace output.
> >
> > Don't know if the above process is safe to run on a production server, >
> without creating much impact on the system resources. Need to check.
> >
>
> Yep. Hopefully you find something.
>
> Regards,
>    Honza
>
> >
> > On Mon, Sep 2, 2019 at 5:50 PM Jan Friesse <jfriesse at redhat.com> wrote:
> >
> >> Jeevan,
> >>
> >> Jeevan Patnaik napsal(a):
> >>> Hi,
> >>>
> >>> Also, both are physical machines.
> >>>
> >>> On Fri, Aug 30, 2019 at 7:23 PM Jeevan Patnaik <g1patnaik at gmail.com>
> >> wrote:
> >>>
> >>>> Hi,
> >>>>
> >>>> We see the following messages almost everyday in our 2 node cluster
> and
> >>>> resources gets migrated when it happens:
> >>>>
> >>>> [16187] node1 corosyncwarning [MAIN  ] Corosync main process was not
> >> scheduled for 2889.8477 ms (threshold is 800.0000 ms). Consider token
> >> timeout increase.
> >>>> [16187] node1 corosyncnotice  [TOTEM ] c.
> >>>> [16187] node1 corosyncnotice  [TOTEM ] A new membership (
> >> 192.168.0.1:1268) was formed. Members joined: 2 left: 2
> >>>> [16187] node1 corosyncnotice  [TOTEM ] Failed to receive the leave
> >> message. failed: 2
> >>>>
> >>>>
> >>>> After setting the token timeout to 6000ms, at least the "Failed to
> >>>> receive the leave message" doesn't appear anymore. But we see corosync
> >>>> timeout errors:
> >>>> [16395] node1 corosyncwarning [MAIN  ] Corosync main process was not
> >>>> scheduled for 6660.9043 ms (threshold is 4800.0000 ms). Consider token
> >>>> timeout increase.
> >>>>
> >>>> 1. Why is the set timeout not in effect? It's 4800ms instead of
> 6000ms.
> >>
> >> It is in effect. Threshold for pause detector is set as 0.8 * token
> >> timeout.
> >>
> >>>> 2. How to fix this? We have not much load on the nodes, the corosync
> is
> >>>> already running with RT priority.
> >>
> >> There must be something wrong. If you increase token timeout even higher
> >> (let's say 12sec) is it still appearing or not? If so, isn't the machine
> >> swapping (for example) or waiting for IO? If you try to run it without
> >> RT priority, does it help?
> >>
> >> Regards,
> >>     Honza
> >>
> >>
> >>>>
> >>>> The following is the details of OS and packages:
> >>>>
> >>>> Kernel: 3.10.0-957.el7.x86_64
> >>>> OS: Oracle Linux Server 7.6
> >>>>
> >>>> corosync-2.4.3-4.el7.x86_64
> >>>> corosynclib-2.4.3-4.el7.x86_64
> >>>>
> >>>> Thanks in advance.
> >>>>
> >>>> --
> >>>> Regards,
> >>>> Jeevan.
> >>>> Create your own email signature
> >>>> <
> >>
> https://www.wisestamp.com/signature-in-email?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own
> >>>
> >>>>
> >>>
> >>>
> >>>
> >>>
> >>> _______________________________________________
> >>> Manage your subscription:
> >>> https://lists.clusterlabs.org/mailman/listinfo/users
> >>>
> >>> ClusterLabs home: https://www.clusterlabs.org/
> >>>
> >>
> >>
> >
> > Regards,
> > Jeevan.
> >
>
>

Regards,
Jeevan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20190904/37041ff3/attachment-0001.html>