[ClusterLabs] Corosync main process was not scheduled for 115935.2266 ms (threshold is 800.0000 ms). Consider token timeout increase.

Thu Feb 18 10:11:31 EST 2016

Jan, thank you for the answer. I still have few questions.

Also, FYI, this happened on a cluster with just one node.
The cluster is designed to be a two-node cluster, with possibility to work
even with only one node.

>> and if corosync was not scheduled for more than token timeout "Process
pause detected for ..." message is displayed and new membership is formed.
That meant other nodes will call STONITH to fence that node. Right?
Although, it didn't happen that way, because there was no the other
"online" node in the cluster.
And because I didn't configure my fencing device to accept node-list, it
was called by the cluster.
And the call was successful because of the logic - the other node was not
connected to the STONITH physical device, equals to a "successful STONITH".
And then that Pacemaker's logic worked out to to shut down itself ("crit:"
message in the log).

>> There is really no help. It's best to make sure corosync is scheduled
regularly.
I may sound silly, but how can I do it?

I did:
# grep -n --color "Process pause detected for\|Corosync main process was
not scheduled for\|invoked oom-killer\|kernel: Out of memory: Kill
process\|kernel: Killed process\|RA:\|error:\|fence\|STONITH"
B5-2U-205-LS.log >B5-2U-205-LS.log.cluster_excerpt

and attached it to the letter.

I see complete disaster in the syslog.
Correct me if I am wrong, but here I will try to analyze what happened to
the cluster:

At some time, when the system has already been working under the really
high load for about 16 hours, Corosync started to report that "Corosync
main process was not scheduled for ...".
Which means that Corosync wasn't scheduled by the OS often enough so it
couldn't detect membership changes (token timeout).
Then, after a few such messages which appeared almost in a row, monitor
operation of few resources failed.
Question: in the log, entries from Resource Agent is shown first, then
"lrmd" reports a timeout problem, like this:

Jan 29 07:00:19 B5-2U-205-LS diskHelper(sm0dh)[18835]: WARNING: RA:
[monitor] : got rc=1
Jan 29 07:00:32 B5-2U-205-LS lrmd[3012]: notice: operation_finished:
sm0dh_monitor_30000:18803:stderr [ Failed to get properties: Connection
timed out ]

Does it mean that the monitor failed because of timeout?

Two minutes later Corosync started to report another message "Process pause
detected for ..."
Which means that Corosync was not scheduled for more than a token timeout.
Then five minutes later I heave this line in the log:

Jan 29 07:05:54 B5-2U-205-LS kernel: stonithd invoked oom-killer:
gfp_mask=0x201da, order=0, oom_score_adj=0

Which I assume states for "stonithd tried to allocate some memory, and
kernel decided to run oom-killer, because there was no enough available
memory". I am right here?
Why stonithd activated that time?
Was it because of "Process pause detected for ..." Corosync's message?
What stonithd actually aimed to do?

Then oom-killer kills one of heavy processes.
Then systemd-journal requests memory (?) and another one heavy resource
goes down, killed by oom-killer.

Both killed resources was under Pacemaker's control.
Other processes managed by Pacemaker report monitor timeout (?).
Then one of them times out on "stop" operation and so Pacemaker requests a
node to be STONITHed.
There is only one node in the cluster and the only running resource is not
designed (properly - don't have "node-list") to kill the node on which it
runs.
And because there is no another guy physically connected to the fencing
device - STONITH reports success.
Pacemaker's internal check works out (thank you guys!) and Pacemaker shuts
down itself.

Please, correct me if I am wrong in this log analyzing. I just want to
level up in understanding what is happening here.
As a general question, is this all happened because of:

For some reasons Corosync started to experience a lack of processor time
(scheduling).
That is why monitor operations started to time out.
Than after "Process pause detected for ..." message I assume the node
should be STONITHed by the other node, but there is no another node, so
what should happen in that case?
Than for some reasons "stonithd" triggered "oom-killer", which killed one
of the managed resources. Why?
Monitor function time out for all resources continuously.
Eventually "stop" function times out for one of the resources, that is why
Pacemaker eventually shuts down.

Please correct me in case I am wrong anywhere in my assumptions.

Thank you for spending your precious time reading all this =)
Hope for some help here =)

Thank you,
Kostia

On Wed, Feb 17, 2016 at 6:47 PM, Jan Friesse <jfriesse at redhat.com> wrote:

> Kostiantyn Ponomarenko napsal(a):
>
>> Thank you for the suggestion.
>> The OS is Debian 8. All Packages are build by myself.
>> libqb-0.17.2
>> corosync-2.3.5
>> cluster-glue-1.0.12
>> pacemaker-1.1.13
>>
>> It is really important for me to understand what is happening with the
>> cluster under the high load.
>>
>
> For Corosync it's really simple. Corosync has to be scheduled by OS
> regularly (more often than it's current token timeout) to be able to detect
> membership changes and send/receive messages (cpg). If it's not scheduled,
> membership is not up to date and eventually when it's finally scheduled, it
> logs "process was not scheduled for ... ms" message (warning for user) and
> if corosync was not scheduled for more than token timeout "Process pause
> detected for ..." message is displayed and new membership is formed. Other
> nodes (if scheduled regularly) sees non regularly scheduled node as dead.
>
> So I would appreciate any help here =)
>>
>
> There is really no help. It's best to make sure corosync is scheduled
> regularly.
>
>
>>
>> Thank you,
>> Kostia
>>
>> On Wed, Feb 17, 2016 at 5:02 PM, Greg Woods <woods at ucar.edu> wrote:
>>
>>
>>> On Wed, Feb 17, 2016 at 3:30 AM, Kostiantyn Ponomarenko <
>>> konstantin.ponomarenko at gmail.com> wrote:
>>>
>>> Jan 29 07:00:43 B5-2U-205-LS corosync[2742]: [MAIN  ] Corosync main
>>>> process was not scheduled for 12483.7363 ms (threshold is 800.0000 ms).
>>>> Consider token timeout increase.
>>>>
>>>
>>>
>>> I was having this problem as well. You don't say which version of
>>> corosync
>>> you are running or on what OS, but on CentOS 7, there is an available
>>>
>>
> This update sets round robin realtime scheduling for corosync by default.
> Same can be achieved without update by editing /etc/sysconfig/corosync and
> changing COROSYNC_OPTIONS line to something like COROSYNC_OPTIONS="-r"
>
> Regards,
>   Honza
>
>
> update that looks like it might address this (it has to do with
>>> scheduling). We haven't gotten around to actually applying it yet because
>>> it will require some down time on production services (we do have a few
>>> node-locked VMs in our cluster), and it only happens when the system is
>>> under very high load, so I can't say for sure the update will fix the
>>> issue, but it might be worth looking into.
>>>
>>> --Greg
>>>
>>>
>>> _______________________________________________
>>> Users mailing list: Users at clusterlabs.org
>>> http://clusterlabs.org/mailman/listinfo/users
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>>
>>>
>>>
>>
>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> http://clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20160218/d6b5823f/attachment-0003.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: B5-2U-205-LS.log.cluster_excerpt
Type: application/octet-stream
Size: 11444 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20160218/d6b5823f/attachment-0003.obj>