[ClusterLabs] Corosync main process was not scheduled for 115935.2266 ms (threshold is 800.0000 ms). Consider token timeout increase.

Wed Feb 24 09:34:58 UTC 2016

Jan, thank you for the help.
I assume that my understanding of what was happening is correct.

Thank you,
Kostia

On Fri, Feb 19, 2016 at 12:11 PM, Jan Friesse <jfriesse at redhat.com> wrote:

>
> Jan, thank you for the answer. I still have few questions.
>>
>> Also, FYI, this happened on a cluster with just one node.
>> The cluster is designed to be a two-node cluster, with possibility to work
>> even with only one node.
>>
>> and if corosync was not scheduled for more than token timeout "Process
>>>>
>>> pause detected for ..." message is displayed and new membership is
>> formed.
>> That meant other nodes will call STONITH to fence that node. Right?
>> Although, it didn't happen that way, because there was no the other
>> "online" node in the cluster.
>> And because I didn't configure my fencing device to accept node-list, it
>> was called by the cluster.
>> And the call was successful because of the logic - the other node was not
>> connected to the STONITH physical device, equals to a "successful
>> STONITH".
>> And then that Pacemaker's logic worked out to to shut down itself ("crit:"
>> message in the log).
>>
>> There is really no help. It's best to make sure corosync is scheduled
>>>>
>>> regularly.
>> I may sound silly, but how can I do it?
>>
>
> It's actually very hard to say. Pauses like 30 sec is really unusual and
> shouldn't happen (specially with RT scheduling). It is usually happening on
> VM where host is overcommitted. I don't think it's your case. Another
> problem may be thrashing (what in theory may be the case because of "Out of
> memory: Kill process 8674 ..." message).
>
>
>
>
>> I did:
>> # grep -n --color "Process pause detected for\|Corosync main process was
>> not scheduled for\|invoked oom-killer\|kernel: Out of memory: Kill
>> process\|kernel: Killed process\|RA:\|error:\|fence\|STONITH"
>> B5-2U-205-LS.log >B5-2U-205-LS.log.cluster_excerpt
>>
>> and attached it to the letter.
>>
>>
>> I see complete disaster in the syslog.
>> Correct me if I am wrong, but here I will try to analyze what happened to
>> the cluster:
>>
>> At some time, when the system has already been working under the really
>> high load for about 16 hours, Corosync started to report that "Corosync
>> main process was not scheduled for ...".
>> Which means that Corosync wasn't scheduled by the OS often enough so it
>> couldn't detect membership changes (token timeout).
>> Then, after a few such messages which appeared almost in a row, monitor
>> operation of few resources failed.
>> Question: in the log, entries from Resource Agent is shown first, then
>> "lrmd" reports a timeout problem, like this:
>>
>> Jan 29 07:00:19 B5-2U-205-LS diskHelper(sm0dh)[18835]: WARNING: RA:
>> [monitor] : got rc=1
>> Jan 29 07:00:32 B5-2U-205-LS lrmd[3012]: notice: operation_finished:
>> sm0dh_monitor_30000:18803:stderr [ Failed to get properties: Connection
>> timed out ]
>>
>> Does it mean that the monitor failed because of timeout?
>>
>> Two minutes later Corosync started to report another message "Process
>> pause
>> detected for ..."
>> Which means that Corosync was not scheduled for more than a token timeout.
>> Then five minutes later I heave this line in the log:
>>
>> Jan 29 07:05:54 B5-2U-205-LS kernel: stonithd invoked oom-killer:
>> gfp_mask=0x201da, order=0, oom_score_adj=0
>>
>> Which I assume states for "stonithd tried to allocate some memory, and
>> kernel decided to run oom-killer, because there was no enough available
>> memory". I am right here?
>> Why stonithd activated that time?
>> Was it because of "Process pause detected for ..." Corosync's message?
>> What stonithd actually aimed to do?
>>
>> Then oom-killer kills one of heavy processes.
>> Then systemd-journal requests memory (?) and another one heavy resource
>> goes down, killed by oom-killer.
>>
>> Both killed resources was under Pacemaker's control.
>> Other processes managed by Pacemaker report monitor timeout (?).
>> Then one of them times out on "stop" operation and so Pacemaker requests a
>> node to be STONITHed.
>> There is only one node in the cluster and the only running resource is not
>> designed (properly - don't have "node-list") to kill the node on which it
>> runs.
>> And because there is no another guy physically connected to the fencing
>> device - STONITH reports success.
>> Pacemaker's internal check works out (thank you guys!) and Pacemaker shuts
>> down itself.
>>
>>
>> Please, correct me if I am wrong in this log analyzing. I just want to
>> level up in understanding what is happening here.
>> As a general question, is this all happened because of:
>>
>> For some reasons Corosync started to experience a lack of processor time
>> (scheduling).
>> That is why monitor operations started to time out.
>> Than after "Process pause detected for ..." message I assume the node
>> should be STONITHed by the other node, but there is no another node, so
>> what should happen in that case?
>> Than for some reasons "stonithd" triggered "oom-killer", which killed one
>> of the managed resources. Why?
>> Monitor function time out for all resources continuously.
>> Eventually "stop" function times out for one of the resources, that is why
>> Pacemaker eventually shuts down.
>>
>> Please correct me in case I am wrong anywhere in my assumptions.
>>
>> Thank you for spending your precious time reading all this =)
>> Hope for some help here =)
>>
>>
>> Thank you,
>> Kostia
>>
>> On Wed, Feb 17, 2016 at 6:47 PM, Jan Friesse <jfriesse at redhat.com> wrote:
>>
>> Kostiantyn Ponomarenko napsal(a):
>>>
>>> Thank you for the suggestion.
>>>> The OS is Debian 8. All Packages are build by myself.
>>>> libqb-0.17.2
>>>> corosync-2.3.5
>>>> cluster-glue-1.0.12
>>>> pacemaker-1.1.13
>>>>
>>>> It is really important for me to understand what is happening with the
>>>> cluster under the high load.
>>>>
>>>>
>>> For Corosync it's really simple. Corosync has to be scheduled by OS
>>> regularly (more often than it's current token timeout) to be able to
>>> detect
>>> membership changes and send/receive messages (cpg). If it's not
>>> scheduled,
>>> membership is not up to date and eventually when it's finally scheduled,
>>> it
>>> logs "process was not scheduled for ... ms" message (warning for user)
>>> and
>>> if corosync was not scheduled for more than token timeout "Process pause
>>> detected for ..." message is displayed and new membership is formed.
>>> Other
>>> nodes (if scheduled regularly) sees non regularly scheduled node as dead.
>>>
>>> So I would appreciate any help here =)
>>>
>>>>
>>>>
>>> There is really no help. It's best to make sure corosync is scheduled
>>> regularly.
>>>
>>>
>>>
>>>> Thank you,
>>>> Kostia
>>>>
>>>> On Wed, Feb 17, 2016 at 5:02 PM, Greg Woods <woods at ucar.edu> wrote:
>>>>
>>>>
>>>> On Wed, Feb 17, 2016 at 3:30 AM, Kostiantyn Ponomarenko <
>>>>> konstantin.ponomarenko at gmail.com> wrote:
>>>>>
>>>>> Jan 29 07:00:43 B5-2U-205-LS corosync[2742]: [MAIN  ] Corosync main
>>>>>
>>>>>> process was not scheduled for 12483.7363 ms (threshold is 800.0000
>>>>>> ms).
>>>>>> Consider token timeout increase.
>>>>>>
>>>>>>
>>>>>
>>>>> I was having this problem as well. You don't say which version of
>>>>> corosync
>>>>> you are running or on what OS, but on CentOS 7, there is an available
>>>>>
>>>>>
>>>> This update sets round robin realtime scheduling for corosync by
>>> default.
>>> Same can be achieved without update by editing /etc/sysconfig/corosync
>>> and
>>> changing COROSYNC_OPTIONS line to something like COROSYNC_OPTIONS="-r"
>>>
>>> Regards,
>>>    Honza
>>>
>>>
>>> update that looks like it might address this (it has to do with
>>>
>>>> scheduling). We haven't gotten around to actually applying it yet
>>>>> because
>>>>> it will require some down time on production services (we do have a few
>>>>> node-locked VMs in our cluster), and it only happens when the system is
>>>>> under very high load, so I can't say for sure the update will fix the
>>>>> issue, but it might be worth looking into.
>>>>>
>>>>> --Greg
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Users mailing list: Users at clusterlabs.org
>>>>> http://clusterlabs.org/mailman/listinfo/users
>>>>>
>>>>> Project Home: http://www.clusterlabs.org
>>>>> Getting started:
>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> Users mailing list: Users at clusterlabs.org
>>>> http://clusterlabs.org/mailman/listinfo/users
>>>>
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started:
>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>>>>
>>>>
>>>>
>>> _______________________________________________
>>> Users mailing list: Users at clusterlabs.org
>>> http://clusterlabs.org/mailman/listinfo/users
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>>
>>>
>>
>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> http://clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20160224/840e01a7/attachment-0002.html>