[ClusterLabs] Antw: 答复: Antw: pacemaker reports monitor timeout while CPU is high

Thu Jan 11 03:18:53 EST 2018

Hi!

A few years ago I was playing with cgroups, getting quite interesting (useful)
results, but applying the cgroups to existing and newly started processes was
quite hard to integrate into the OS, so I did not proceed on that way. I think
cgroups is even more powerful today, but I haven't followed the ease of using
it in systems based on systemd (which uses cgroups heavily AFAIK).

In short: You may be unable to control the client processes, but you could
control the server processes the clients start.

Regards,
Ulrich

>>> ??? <fanguoteng at highgo.com> schrieb am 11.01.2018 um 05:01 in Nachricht
<492a1ace20c04e85bc4979307af2a0be at EX01.highgo.com>:
> Ulrich,
> 
> Thank you very much for the help. When we do the performance test, our 
> application(pgsql-ha) will start more than 500 process to process the client

> request. Is it possible to make this issue?
> 
> Is it any workaround or method to make pacemaker not restart the resource in

> such situation? Now the system could not work if the client sends high call

> load but we could not control the client's behavior. 
> 
> Thanks
> 
> 
> -----邮件原件-----
> 发件人: Ulrich Windl [mailto:Ulrich.Windl at rz.uni-regensburg.de] 
> 发送时间: 2018年1月10日 18:20
> 收件人: users at clusterlabs.org 
> 主题: [ClusterLabs] Antw: pacemaker reports monitor timeout while CPU is high
> 
> Hi!
> 
> I only can talk for myself: In former times with HP-UX, we had severe 
> performance problems when the load was in the range of 8 to 14 (I/O waits
not 
> included, average for all logical CPUs), while in Linux we are getting 
> problems with a load above 40 (or so) (I/O included, sum of all logical CPUs

> (which are 24)). Also I/O waits cause cluster timeouts before CPU load 
> actually matters (for us).
> So with a load above 400 (not knowing your number of CPUs) it should not be

> that unusual. What is the number of threads in your system at that time?
> It might be worth the efforts binding the cluster processes to specific CPUs

> and keep other tasks away from those, but I don't have experience with
that.
> I guess the "High CPU load detected" message triggers some internal suspend

> in the cluster engine (assuming the cluster engine caused the high load). Of

> course for "external " load the measure won't help...
> 
> Regards,
> Ulrich
> 
> 
>>>> ??? <fanguoteng at highgo.com> schrieb am 10.01.2018 um 10:40 in 
>>>> Nachricht
> <4dc98a5d9be144a78fb9a187217439ed at EX01.highgo.com>:
>> Hello,
>> 
>> This issue only appears when we run performance test and the CPU is high. 
>> The cluster and log is as below. The Pacemaker will restart the Slave 
>> Side pgsql-ha resource about every two minutes.
>> 
>> Take the following scenario for example:（when the pgsqlms RA is 
>> called, we print the log “execute the command start (command)”. When 
>> the command is
> 
>> returned, we print the log “execute the command stop (Command)
> (result)”）
>> 
>> 1.     We could see that pacemaker call “pgsqlms monitor” about every
15
> 
>> seconds. And it return $OCF_SUCCESS
>> 
>> 2.     In calls monitor command again at 13:56:16, and then it reports 
>> timeout error error 13:56:18. It is only 2 seconds but it reports 
>> “timeout=10000ms”
>> 
>> 3.     In other logs, sometimes after 15 minutes, there is no “execute
the
> 
>> command start monitor” printed and it reports timeout error directly.
>> 
>> Could you please tell how to debug or resolve such issue?
>> 
>> The log:
>> 
>> Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: execute the command 
>> start
> 
>> monitor
>> Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: _confirm_role start 
>> Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: _confirm_role stop 0 
>> Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: execute the command 
>> stop monitor 0 Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: 
>> execute the command start
> 
>> monitor
>> Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: _confirm_role start 
>> Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: _confirm_role stop 0 
>> Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: execute the command 
>> stop monitor 0 Jan 10 13:56:02 sds2 crmd[26096]:  notice: High CPU 
>> load detected:
>> 426.779999
>> Jan 10 13:56:16 sds2 pgsqlms(pgsqld)[5606]: INFO: execute the command 
>> start
> 
>> monitor
>> Jan 10 13:56:18 sds2 lrmd[26093]: warning: pgsqld_monitor_16000 
>> process (PID
> 
>> 5606) timed out
>> Jan 10 13:56:18 sds2 lrmd[26093]: warning: pgsqld_monitor_16000:5606 - 
>> timed
> 
>> out after 10000ms
>> Jan 10 13:56:18 sds2 crmd[26096]:   error: Result of monitor operation for

>> pgsqld on db2: Timed Out | call=102 key=pgsqld_monitor_16000
> timeout=10000ms
>> Jan 10 13:56:18 sds2 crmd[26096]:  notice: 
>> db2-pgsqld_monitor_16000:102 [
>> /tmp:5432 - accepting connections\n ]
>> Jan 10 13:56:18 sds2 crmd[26096]:  notice: State transition S_IDLE -> 
>> S_POLICY_ENGINE | input=I_PE_CALC cause=C_FSA_INTERNAL 
>> origin=abort_transition_graph Jan 10 13:56:19 sds2 pengine[26095]: 
>> warning: Processing failed op monitor for pgsqld:0 on db2: unknown 
>> error (1) Jan 10 13:56:19 sds2 pengine[26095]: warning: Processing 
>> failed op start for
> 
>> pgsqld:1 on db1: unknown error (1)
>> Jan 10 13:56:19 sds2 pengine[26095]: warning: Forcing pgsql-ha away 
>> from db1
> 
>> after 1000000 failures (max=1000000)
>> Jan 10 13:56:19 sds2 pengine[26095]: warning: Forcing pgsql-ha away 
>> from db1
> 
>> after 1000000 failures (max=1000000)
>> Jan 10 13:56:19 sds2 pengine[26095]:  notice: Recover 
>> pgsqld:0#011(Slave
>> db2)
>> Jan 10 13:56:19 sds2 pengine[26095]:  notice: Calculated transition 
>> 37, saving inputs in /var/lib/pacemaker/pengine/pe-input-1251.bz2
>> 
>> 
>> The Cluster Configuration:
>> 2 nodes and 13 resources configured
>> 
>> Online: [ db1 db2 ]
>> 
>> Full list of resources:
>> 
>> Clone Set: dlm-clone [dlm]
>>      Started: [ db1 db2 ]
>> Clone Set: clvmd-clone [clvmd]
>>      Started: [ db1 db2 ]
>> ipmi_node1     (stonith:fence_ipmilan):        Started db2
>> ipmi_node2     (stonith:fence_ipmilan):        Started db1
>> Clone Set: clusterfs-clone [clusterfs]
>>      Started: [ db1 db2 ]
>> Master/Slave Set: pgsql-ha [pgsqld]>
>> 
>>       Masters: [ db1 ]
>> 
>> Slaves: [ db2 ]
>> Resource Group: mastergroup
>>      db1-vip    (ocf::heartbeat:IPaddr2):       Started
>>      rep-vip    (ocf::heartbeat:IPaddr2):       Started
>> Resource Group: slavegroup
>>      db2-vip    (ocf::heartbeat:IPaddr2):       Started
>> 
>> 
>> pcs resource show pgsql-ha
>> Master: pgsql-ha
>>   Meta Attrs: interleave=true notify=true
>>   Resource: pgsqld (class=ocf provider=heartbeat type=pgsqlms)
>>    Attributes: bindir=/usr/local/pgsql/bin pgdata=/home/postgres/data
>>    Operations: start interval=0s timeout=160s (pgsqld-start-interval-0s)
>>                stop interval=0s timeout=60s (pgsqld-stop-interval-0s)
>>                promote interval=0s timeout=130s
> (pgsqld-promote-interval-0s)
>>                demote interval=0s timeout=120s (pgsqld-demote-interval-0s)
>>                monitor interval=15s role=Master timeout=10s
>> (pgsqld-monitor-interval-15s)
>>                monitor interval=16s role=Slave timeout=10s
>> (pgsqld-monitor-interval-16s)
>>                notify interval=0s timeout=60s 
>> (pgsqld-notify-interval-0s)
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org 
> http://lists.clusterlabs.org/mailman/listinfo/users 
> 
> Project Home: http://www.clusterlabs.org Getting started: 
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> Bugs: http://bugs.clusterlabs.org 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org 
> http://lists.clusterlabs.org/mailman/listinfo/users 
> 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> Bugs: http://bugs.clusterlabs.org