[ClusterLabs] 答复: Antw: pacemaker reports monitor timeout while CPU is high

Wed Jan 10 23:01:06 EST 2018

Ulrich,

Thank you very much for the help. When we do the performance test, our application(pgsql-ha) will start more than 500 process to process the client request. Is it possible to make this issue?

Is it any workaround or method to make pacemaker not restart the resource in such situation? Now the system could not work if the client sends high call load but we could not control the client's behavior. 

Thanks

-----邮件原件-----
发件人: Ulrich Windl [mailto:Ulrich.Windl at rz.uni-regensburg.de] 
发送时间: 2018年1月10日 18:20
收件人: users at clusterlabs.org
主题: [ClusterLabs] Antw: pacemaker reports monitor timeout while CPU is high

Hi!

I only can talk for myself: In former times with HP-UX, we had severe performance problems when the load was in the range of 8 to 14 (I/O waits not included, average for all logical CPUs), while in Linux we are getting problems with a load above 40 (or so) (I/O included, sum of all logical CPUs (which are 24)). Also I/O waits cause cluster timeouts before CPU load actually matters (for us).
So with a load above 400 (not knowing your number of CPUs) it should not be that unusual. What is the number of threads in your system at that time?
It might be worth the efforts binding the cluster processes to specific CPUs and keep other tasks away from those, but I don't have experience with that.
I guess the "High CPU load detected" message triggers some internal suspend in the cluster engine (assuming the cluster engine caused the high load). Of course for "external " load the measure won't help...

Regards,
Ulrich

>>> ??? <fanguoteng at highgo.com> schrieb am 10.01.2018 um 10:40 in 
>>> Nachricht
<4dc98a5d9be144a78fb9a187217439ed at EX01.highgo.com>:
> Hello,
> 
> This issue only appears when we run performance test and the CPU is high. 
> The cluster and log is as below. The Pacemaker will restart the Slave 
> Side pgsql-ha resource about every two minutes.
> 
> Take the following scenario for example:（when the pgsqlms RA is 
> called, we print the log “execute the command start (command)”. When 
> the command is

> returned, we print the log “execute the command stop (Command)
(result)”）
> 
> 1.     We could see that pacemaker call “pgsqlms monitor” about every 15

> seconds. And it return $OCF_SUCCESS
> 
> 2.     In calls monitor command again at 13:56:16, and then it reports 
> timeout error error 13:56:18. It is only 2 seconds but it reports 
> “timeout=10000ms”
> 
> 3.     In other logs, sometimes after 15 minutes, there is no “execute the

> command start monitor” printed and it reports timeout error directly.
> 
> Could you please tell how to debug or resolve such issue?
> 
> The log:
> 
> Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: execute the command 
> start

> monitor
> Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: _confirm_role start 
> Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: _confirm_role stop 0 
> Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: execute the command 
> stop monitor 0 Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: 
> execute the command start

> monitor
> Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: _confirm_role start 
> Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: _confirm_role stop 0 
> Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: execute the command 
> stop monitor 0 Jan 10 13:56:02 sds2 crmd[26096]:  notice: High CPU 
> load detected:
> 426.779999
> Jan 10 13:56:16 sds2 pgsqlms(pgsqld)[5606]: INFO: execute the command 
> start

> monitor
> Jan 10 13:56:18 sds2 lrmd[26093]: warning: pgsqld_monitor_16000 
> process (PID

> 5606) timed out
> Jan 10 13:56:18 sds2 lrmd[26093]: warning: pgsqld_monitor_16000:5606 - 
> timed

> out after 10000ms
> Jan 10 13:56:18 sds2 crmd[26096]:   error: Result of monitor operation for 
> pgsqld on db2: Timed Out | call=102 key=pgsqld_monitor_16000
timeout=10000ms
> Jan 10 13:56:18 sds2 crmd[26096]:  notice: 
> db2-pgsqld_monitor_16000:102 [
> /tmp:5432 - accepting connections\n ]
> Jan 10 13:56:18 sds2 crmd[26096]:  notice: State transition S_IDLE -> 
> S_POLICY_ENGINE | input=I_PE_CALC cause=C_FSA_INTERNAL 
> origin=abort_transition_graph Jan 10 13:56:19 sds2 pengine[26095]: 
> warning: Processing failed op monitor for pgsqld:0 on db2: unknown 
> error (1) Jan 10 13:56:19 sds2 pengine[26095]: warning: Processing 
> failed op start for

> pgsqld:1 on db1: unknown error (1)
> Jan 10 13:56:19 sds2 pengine[26095]: warning: Forcing pgsql-ha away 
> from db1

> after 1000000 failures (max=1000000)
> Jan 10 13:56:19 sds2 pengine[26095]: warning: Forcing pgsql-ha away 
> from db1

> after 1000000 failures (max=1000000)
> Jan 10 13:56:19 sds2 pengine[26095]:  notice: Recover 
> pgsqld:0#011(Slave
> db2)
> Jan 10 13:56:19 sds2 pengine[26095]:  notice: Calculated transition 
> 37, saving inputs in /var/lib/pacemaker/pengine/pe-input-1251.bz2
> 
> 
> The Cluster Configuration:
> 2 nodes and 13 resources configured
> 
> Online: [ db1 db2 ]
> 
> Full list of resources:
> 
> Clone Set: dlm-clone [dlm]
>      Started: [ db1 db2 ]
> Clone Set: clvmd-clone [clvmd]
>      Started: [ db1 db2 ]
> ipmi_node1     (stonith:fence_ipmilan):        Started db2
> ipmi_node2     (stonith:fence_ipmilan):        Started db1
> Clone Set: clusterfs-clone [clusterfs]
>      Started: [ db1 db2 ]
> Master/Slave Set: pgsql-ha [pgsqld]>
> 
>       Masters: [ db1 ]
> 
> Slaves: [ db2 ]
> Resource Group: mastergroup
>      db1-vip    (ocf::heartbeat:IPaddr2):       Started
>      rep-vip    (ocf::heartbeat:IPaddr2):       Started
> Resource Group: slavegroup
>      db2-vip    (ocf::heartbeat:IPaddr2):       Started
> 
> 
> pcs resource show pgsql-ha
> Master: pgsql-ha
>   Meta Attrs: interleave=true notify=true
>   Resource: pgsqld (class=ocf provider=heartbeat type=pgsqlms)
>    Attributes: bindir=/usr/local/pgsql/bin pgdata=/home/postgres/data
>    Operations: start interval=0s timeout=160s (pgsqld-start-interval-0s)
>                stop interval=0s timeout=60s (pgsqld-stop-interval-0s)
>                promote interval=0s timeout=130s
(pgsqld-promote-interval-0s)
>                demote interval=0s timeout=120s (pgsqld-demote-interval-0s)
>                monitor interval=15s role=Master timeout=10s
> (pgsqld-monitor-interval-15s)
>                monitor interval=16s role=Slave timeout=10s
> (pgsqld-monitor-interval-16s)
>                notify interval=0s timeout=60s 
> (pgsqld-notify-interval-0s)

_______________________________________________
Users mailing list: Users at clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org