<div dir="ltr"><div dir="ltr">Hi Honza,</div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Sep 3, 2019 at 7:20 PM Jan Friesse <<a href="mailto:jfriesse@redhat.com">jfriesse@redhat.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Jeevan,<br>
<br>
Jeevan Patnaik napsal(a):<br>
> Hi Honza,<br>
> <br>
> Thanks for the response.<br>
> <br>
> If you increase token timeout even higher<br>
> (let's say 12sec) is it still appearing or not?<br>
> - I will try this.<br>
> <br>
> If you try to run it without RT priority, does it help?<br>
> - Can RT priority affect the process scheduling negatively?<br>
<br>
Actually we've had report that it can, because it blocks kernel thread <br>
which is responsible for sending/receiving packets. I was not able to <br>
reporduce this behavior myself, and it seemed to be kernel specific, but <br>
resolution was that behavior without RT was better.<br></blockquote><div>Thanks. I will check this. Also in theory, can blocking kernel thread responsible for sending/receiving packets affect scheduling of the corosync process (with RT priority) ?</div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
> <br>
> I don't see any irregular IO activity during the time when we got these<br>
> errors. Also, swap usage and swap IO is not much at all, it's only in KBs.<br>
> we have vm.swappiness set to 1. So, I don't think swap is causing any issue.<br>
> <br>
> However, I see slight network activity during the issue times (What I<br>
> understand is network activity should not affect the CPU jobs as long as<br>
> CPU load is normal and without any blocking IO).<br>
<br>
It shouldn't<br>
<br>
> <br>
> I am thinking of debugging in the following way, unless there is option to<br>
> restart corosync with debugger mode. :<br>
<br>
You can turn on debug messages (debug: on in logging section of <br>
corosync.conf).<br>
<br></blockquote><div>Yes, I found thist later. Will try debugging. Hoping it would help in knowing where the problem is. </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
> <br>
> -> Run a process strace in background on the corosync process and redirect<br>
> log to a output<br>
> -> Add a frequent cron job to rotate the output log (delete old ones),<br>
> unless there is a flag file to keep the old log<br>
> -> Add another frequent cron job to check corosync log for the specific<br>
> token timeout error and add the above mentioned flag file to not delete the<br>
> strace output.<br>
> <br>
> Don't know if the above process is safe to run on a production server, > without creating much impact on the system resources. Need to check.<br>
> <br>
<br>
Yep. Hopefully you find something.<br>
<br>
Regards,<br>
Honza<br>
<br>
> <br>
> On Mon, Sep 2, 2019 at 5:50 PM Jan Friesse <<a href="mailto:jfriesse@redhat.com" target="_blank">jfriesse@redhat.com</a>> wrote:<br>
> <br>
>> Jeevan,<br>
>><br>
>> Jeevan Patnaik napsal(a):<br>
>>> Hi,<br>
>>><br>
>>> Also, both are physical machines.<br>
>>><br>
>>> On Fri, Aug 30, 2019 at 7:23 PM Jeevan Patnaik <<a href="mailto:g1patnaik@gmail.com" target="_blank">g1patnaik@gmail.com</a>><br>
>> wrote:<br>
>>><br>
>>>> Hi,<br>
>>>><br>
>>>> We see the following messages almost everyday in our 2 node cluster and<br>
>>>> resources gets migrated when it happens:<br>
>>>><br>
>>>> [16187] node1 corosyncwarning [MAIN ] Corosync main process was not<br>
>> scheduled for 2889.8477 ms (threshold is 800.0000 ms). Consider token<br>
>> timeout increase.<br>
>>>> [16187] node1 corosyncnotice [TOTEM ] c.<br>
>>>> [16187] node1 corosyncnotice [TOTEM ] A new membership (<br>
>> <a href="http://192.168.0.1:1268" rel="noreferrer" target="_blank">192.168.0.1:1268</a>) was formed. Members joined: 2 left: 2<br>
>>>> [16187] node1 corosyncnotice [TOTEM ] Failed to receive the leave<br>
>> message. failed: 2<br>
>>>><br>
>>>><br>
>>>> After setting the token timeout to 6000ms, at least the "Failed to<br>
>>>> receive the leave message" doesn't appear anymore. But we see corosync<br>
>>>> timeout errors:<br>
>>>> [16395] node1 corosyncwarning [MAIN ] Corosync main process was not<br>
>>>> scheduled for 6660.9043 ms (threshold is 4800.0000 ms). Consider token<br>
>>>> timeout increase.<br>
>>>><br>
>>>> 1. Why is the set timeout not in effect? It's 4800ms instead of 6000ms.<br>
>><br>
>> It is in effect. Threshold for pause detector is set as 0.8 * token<br>
>> timeout.<br>
>><br>
>>>> 2. How to fix this? We have not much load on the nodes, the corosync is<br>
>>>> already running with RT priority.<br>
>><br>
>> There must be something wrong. If you increase token timeout even higher<br>
>> (let's say 12sec) is it still appearing or not? If so, isn't the machine<br>
>> swapping (for example) or waiting for IO? If you try to run it without<br>
>> RT priority, does it help?<br>
>><br>
>> Regards,<br>
>> Honza<br>
>><br>
>><br>
>>>><br>
>>>> The following is the details of OS and packages:<br>
>>>><br>
>>>> Kernel: 3.10.0-957.el7.x86_64<br>
>>>> OS: Oracle Linux Server 7.6<br>
>>>><br>
>>>> corosync-2.4.3-4.el7.x86_64<br>
>>>> corosynclib-2.4.3-4.el7.x86_64<br>
>>>><br>
>>>> Thanks in advance.<br>
>>>><br>
>>>> --<br>
>>>> Regards,<br>
>>>> Jeevan.<br>
>>>> Create your own email signature<br>
>>>> <<br>
>> <a href="https://www.wisestamp.com/signature-in-email?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own" rel="noreferrer" target="_blank">https://www.wisestamp.com/signature-in-email?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own</a><br>
>>><br>
>>>><br>
>>><br>
>>><br>
>>><br>
>>><br>
>>> _______________________________________________<br>
>>> Manage your subscription:<br>
>>> <a href="https://lists.clusterlabs.org/mailman/listinfo/users" rel="noreferrer" target="_blank">https://lists.clusterlabs.org/mailman/listinfo/users</a><br>
>>><br>
>>> ClusterLabs home: <a href="https://www.clusterlabs.org/" rel="noreferrer" target="_blank">https://www.clusterlabs.org/</a><br>
>>><br>
>><br>
>><br>
> <br>
> Regards,<br>
> Jeevan.<br>
> <br>
<br>
</blockquote></div><br clear="all"><div><br></div>Regards,<br>Jeevan<div dir="ltr" class="gmail_signature"><div href="http://WS_promo" style="width:auto;padding-top:2px;font-size:10px;border-top:1px solid rgb(238,238,238);margin-top:10px;display:table;direction:ltr;line-height:normal;border-spacing:initial">
</div>
</div></div>