[ClusterLabs] Antw: pacemaker reports monitor timeout while CPU is high

Wed Jan 10 05:19:36 EST 2018

Hi!

I only can talk for myself: In former times with HP-UX, we had severe
performance problems when the load was in the range of 8 to 14 (I/O waits not
included, average for all logical CPUs), while in Linux we are getting problems
with a load above 40 (or so) (I/O included, sum of all logical CPUs (which are
24)). Also I/O waits cause cluster timeouts before CPU load actually matters
(for us).
So with a load above 400 (not knowing your number of CPUs) it should not be
that unusual. What is the number of threads in your system at that time?
It might be worth the efforts binding the cluster processes to specific CPUs
and keep other tasks away from those, but I don't have experience with that.
I guess the "High CPU load detected" message triggers some internal suspend in
the cluster engine (assuming the cluster engine caused the high load). Of
course for "external " load the measure won't help...

Regards,
Ulrich

>>> ??? <fanguoteng at highgo.com> schrieb am 10.01.2018 um 10:40 in Nachricht
<4dc98a5d9be144a78fb9a187217439ed at EX01.highgo.com>:
> Hello,
> 
> This issue only appears when we run performance test and the CPU is high. 
> The cluster and log is as below. The Pacemaker will restart the Slave Side 
> pgsql-ha resource about every two minutes.
> 
> Take the following scenario for example:（when the pgsqlms RA is called, we 
> print the log “execute the command start (command)”. When the command is

> returned, we print the log “execute the command stop (Command)
(result)”）
> 
> 1.     We could see that pacemaker call “pgsqlms monitor” about every 15

> seconds. And it return $OCF_SUCCESS
> 
> 2.     In calls monitor command again at 13:56:16, and then it reports 
> timeout error error 13:56:18. It is only 2 seconds but it reports 
> “timeout=10000ms”
> 
> 3.     In other logs, sometimes after 15 minutes, there is no “execute the

> command start monitor” printed and it reports timeout error directly.
> 
> Could you please tell how to debug or resolve such issue?
> 
> The log:
> 
> Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: execute the command start

> monitor
> Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: _confirm_role start
> Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: _confirm_role stop 0
> Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: execute the command stop 
> monitor 0
> Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: execute the command start

> monitor
> Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: _confirm_role start
> Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: _confirm_role stop 0
> Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: execute the command stop 
> monitor 0
> Jan 10 13:56:02 sds2 crmd[26096]:  notice: High CPU load detected: 
> 426.779999
> Jan 10 13:56:16 sds2 pgsqlms(pgsqld)[5606]: INFO: execute the command start

> monitor
> Jan 10 13:56:18 sds2 lrmd[26093]: warning: pgsqld_monitor_16000 process (PID

> 5606) timed out
> Jan 10 13:56:18 sds2 lrmd[26093]: warning: pgsqld_monitor_16000:5606 - timed

> out after 10000ms
> Jan 10 13:56:18 sds2 crmd[26096]:   error: Result of monitor operation for 
> pgsqld on db2: Timed Out | call=102 key=pgsqld_monitor_16000
timeout=10000ms
> Jan 10 13:56:18 sds2 crmd[26096]:  notice: db2-pgsqld_monitor_16000:102 [ 
> /tmp:5432 - accepting connections\n ]
> Jan 10 13:56:18 sds2 crmd[26096]:  notice: State transition S_IDLE -> 
> S_POLICY_ENGINE | input=I_PE_CALC cause=C_FSA_INTERNAL 
> origin=abort_transition_graph
> Jan 10 13:56:19 sds2 pengine[26095]: warning: Processing failed op monitor 
> for pgsqld:0 on db2: unknown error (1)
> Jan 10 13:56:19 sds2 pengine[26095]: warning: Processing failed op start for

> pgsqld:1 on db1: unknown error (1)
> Jan 10 13:56:19 sds2 pengine[26095]: warning: Forcing pgsql-ha away from db1

> after 1000000 failures (max=1000000)
> Jan 10 13:56:19 sds2 pengine[26095]: warning: Forcing pgsql-ha away from db1

> after 1000000 failures (max=1000000)
> Jan 10 13:56:19 sds2 pengine[26095]:  notice: Recover pgsqld:0#011(Slave 
> db2)
> Jan 10 13:56:19 sds2 pengine[26095]:  notice: Calculated transition 37, 
> saving inputs in /var/lib/pacemaker/pengine/pe-input-1251.bz2
> 
> 
> The Cluster Configuration:
> 2 nodes and 13 resources configured
> 
> Online: [ db1 db2 ]
> 
> Full list of resources:
> 
> Clone Set: dlm-clone [dlm]
>      Started: [ db1 db2 ]
> Clone Set: clvmd-clone [clvmd]
>      Started: [ db1 db2 ]
> ipmi_node1     (stonith:fence_ipmilan):        Started db2
> ipmi_node2     (stonith:fence_ipmilan):        Started db1
> Clone Set: clusterfs-clone [clusterfs]
>      Started: [ db1 db2 ]
> Master/Slave Set: pgsql-ha [pgsqld]>
> 
>       Masters: [ db1 ]
> 
> Slaves: [ db2 ]
> Resource Group: mastergroup
>      db1-vip    (ocf::heartbeat:IPaddr2):       Started
>      rep-vip    (ocf::heartbeat:IPaddr2):       Started
> Resource Group: slavegroup
>      db2-vip    (ocf::heartbeat:IPaddr2):       Started
> 
> 
> pcs resource show pgsql-ha
> Master: pgsql-ha
>   Meta Attrs: interleave=true notify=true
>   Resource: pgsqld (class=ocf provider=heartbeat type=pgsqlms)
>    Attributes: bindir=/usr/local/pgsql/bin pgdata=/home/postgres/data
>    Operations: start interval=0s timeout=160s (pgsqld-start-interval-0s)
>                stop interval=0s timeout=60s (pgsqld-stop-interval-0s)
>                promote interval=0s timeout=130s
(pgsqld-promote-interval-0s)
>                demote interval=0s timeout=120s (pgsqld-demote-interval-0s)
>                monitor interval=15s role=Master timeout=10s 
> (pgsqld-monitor-interval-15s)
>                monitor interval=16s role=Slave timeout=10s 
> (pgsqld-monitor-interval-16s)
>                notify interval=0s timeout=60s (pgsqld-notify-interval-0s)