[ClusterLabs] 答复: pacemaker reports monitor timeout while CPU is high

Wed Jan 10 22:50:44 EST 2018

Thank you, Ken.

We have set the timeout to be 10 seconds, but it reports timeout only after 2 seconds. So it seems not work if I set higher timeouts.
Our application which is managed by pacemaker will start more than 500 process to run when running performance test. Does it affect the result? Which log could help us to analyze?

> monitor interval=16s role=Slave timeout=10s (pgsqld-monitor-interval-16s)

-----邮件原件-----
发件人: Ken Gaillot [mailto:kgaillot at redhat.com] 
发送时间: 2018年1月11日 0:54
收件人: Cluster Labs - All topics related to open-source clustering welcomed <users at clusterlabs.org>
主题: Re: [ClusterLabs] pacemaker reports monitor timeout while CPU is high

On Wed, 2018-01-10 at 09:40 +0000, 范国腾 wrote:
> Hello,
>  
> This issue only appears when we run performance test and the CPU is 
> high. The cluster and log is as below. The Pacemaker will restart the 
> Slave Side pgsql-ha resource about every two minutes.
>  
> Take the following scenario for example:（when the pgsqlms RA is 
> called, we print the log “execute the command start (command)”. When 
> the command is returned, we print the log “execute the command stop
> (Command) (result)”）
> 1.     We could see that pacemaker call “pgsqlms monitor” about every
> 15 seconds. And it return $OCF_SUCCESS 2.     In calls monitor command 
> again at 13:56:16, and then it reports timeout error error 13:56:18. 
> It is only 2 seconds but it reports “timeout=10000ms”
> 3.     In other logs, sometimes after 15 minutes, there is no “execute 
> the command start monitor” printed and it reports timeout error 
> directly.
>  
> Could you please tell how to debug or resolve such issue?
>  
> The log:
>  
> Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: execute the command 
> start monitor Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: 
> _confirm_role start Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: 
> _confirm_role stop
> 0
> Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: execute the command 
> stop monitor 0 Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: 
> execute the command start monitor Jan 10 13:55:52 sds2 
> pgsqlms(pgsqld)[5477]: INFO: _confirm_role start Jan 10 13:55:52 sds2 
> pgsqlms(pgsqld)[5477]: INFO: _confirm_role stop
> 0
> Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: execute the command 
> stop monitor 0 Jan 10 13:56:02 sds2 crmd[26096]:  notice: High CPU 
> load detected:
> 426.779999
> Jan 10 13:56:16 sds2 pgsqlms(pgsqld)[5606]: INFO: execute the command 
> start monitor Jan 10 13:56:18 sds2 lrmd[26093]: warning: 
> pgsqld_monitor_16000 process (PID 5606) timed out

There's something more going on than in this log snippet. Notice the process that timed out (5606) is not one of the processes that logged above (5240 and 5477).

Generally, once load gets that high, it's very difficult to maintain responsiveness, and the expectation is that another node will fence it.
But it can often be worked around with high timeouts, and/or you can use rules to set higher timeouts or maintenance mode during times when high load is expected.

> Jan 10 13:56:18 sds2 lrmd[26093]: warning: pgsqld_monitor_16000:5606
> - timed out after 10000ms
> Jan 10 13:56:18 sds2 crmd[26096]:   error: Result of monitor operation 
> for pgsqld on db2: Timed Out | call=102
> key=pgsqld_monitor_16000 timeout=10000ms Jan 10 13:56:18 sds2 
> crmd[26096]:  notice: db2-
> pgsqld_monitor_16000:102 [ /tmp:5432 - accepting connections\n ] Jan 
> 10 13:56:18 sds2 crmd[26096]:  notice: State transition S_IDLE -> 
> S_POLICY_ENGINE | input=I_PE_CALC cause=C_FSA_INTERNAL 
> origin=abort_transition_graph Jan 10 13:56:19 sds2 pengine[26095]: 
> warning: Processing failed op monitor for pgsqld:0 on db2: unknown 
> error (1) Jan 10 13:56:19 sds2 pengine[26095]: warning: Processing 
> failed op start for pgsqld:1 on db1: unknown error (1) Jan 10 13:56:19 
> sds2 pengine[26095]: warning: Forcing pgsql-ha away from db1 after 
> 1000000 failures (max=1000000) Jan 10 13:56:19 sds2 pengine[26095]: 
> warning: Forcing pgsql-ha away from db1 after 1000000 failures 
> (max=1000000) Jan 10 13:56:19 sds2 pengine[26095]:  notice: Recover 
> pgsqld:0#011(Slave db2) Jan 10 13:56:19 sds2 pengine[26095]:  notice: 
> Calculated transition 37, saving inputs in 
> /var/lib/pacemaker/pengine/pe-input-1251.bz2
>  
>  
> The Cluster Configuration:
> 2 nodes and 13 resources configured
>  
> Online: [ db1 db2 ]
>  
> Full list of resources:
>  
> Clone Set: dlm-clone [dlm]
>      Started: [ db1 db2 ]
> Clone Set: clvmd-clone [clvmd]
>      Started: [ db1 db2 ]
> ipmi_node1     (stonith:fence_ipmilan):        Started db2
> ipmi_node2     (stonith:fence_ipmilan):        Started db1 Clone Set: 
> clusterfs-clone [clusterfs]
>      Started: [ db1 db2 ]
> Master/Slave Set: pgsql-ha [pgsqld]>
>       Masters: [ db1 ]
> Slaves: [ db2 ]
> Resource Group: mastergroup
>      db1-vip    (ocf::heartbeat:IPaddr2):       Started
>      rep-vip    (ocf::heartbeat:IPaddr2):       Started Resource 
> Group: slavegroup
>      db2-vip    (ocf::heartbeat:IPaddr2):       Started
>  
>  
> pcs resource show pgsql-ha
> Master: pgsql-ha
>   Meta Attrs: interleave=true notify=true
>   Resource: pgsqld (class=ocf provider=heartbeat type=pgsqlms)
>    Attributes: bindir=/usr/local/pgsql/bin pgdata=/home/postgres/data
>    Operations: start interval=0s timeout=160s (pgsqld-start-interval-
> 0s)
>                stop interval=0s timeout=60s (pgsqld-stop-interval-0s)
>                promote interval=0s timeout=130s (pgsqld-promote-
> interval-0s)
>                demote interval=0s timeout=120s (pgsqld-demote-
> interval-0s)
>                monitor interval=15s role=Master timeout=10s (pgsqld-
> monitor-interval-15s)
>                monitor interval=16s role=Slave timeout=10s (pgsqld-
> monitor-interval-16s)
>                notify interval=0s timeout=60s (pgsqld-notify-
> interval-0s)
> _______________________________________________
> Users mailing list: Users at clusterlabs.org 
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org Getting started: 
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.
> pdf
> Bugs: http://bugs.clusterlabs.org
--
Ken Gaillot <kgaillot at redhat.com>

_______________________________________________
Users mailing list: Users at clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org