[ClusterLabs] 答复: pacemaker reports monitor timeout while CPU is high

Thu Jan 11 10:48:10 EST 2018

On Thu, 2018-01-11 at 03:50 +0000, 范国腾 wrote:
> Thank you, Ken.
> 
> We have set the timeout to be 10 seconds, but it reports timeout only
> after 2 seconds. So it seems not work if I set higher timeouts.
> Our application which is managed by pacemaker will start more than
> 500 process to run when running performance test. Does it affect the
> result? Which log could help us to analyze?
> 
> > monitor interval=16s role=Slave timeout=10s (pgsqld-monitor-
> > interval-16s)

It's not timing out after 2 seconds. The message:

  sds2 pgsqlms(pgsqld)[5240]: INFO: execute the command start monitor

indicates that the monitor's process ID is 5240, but the message:

  sds2 lrmd[26093]: warning: pgsqld_monitor_16000 process (PID 5606)
timed out

indicates that the monitor that timed out had process ID 5606. That
means that there were two separate monitors in progress. I'm not sure
why; I wouldn't expect the second one to be started until after the
first one had timed out. But it's possible with the high load that the
log messages were simply written to the log out of order, since they
were written by different processes.

I would just raise the timeout higher than 10s during the test.

> 
> -----邮件原件-----
> 发件人: Ken Gaillot [mailto:kgaillot at redhat.com] 
> 发送时间: 2018年1月11日 0:54
> 收件人: Cluster Labs - All topics related to open-source clustering
> welcomed <users at clusterlabs.org>
> 主题: Re: [ClusterLabs] pacemaker reports monitor timeout while CPU is
> high
> 
> On Wed, 2018-01-10 at 09:40 +0000, 范国腾 wrote:
> > Hello,
> >  
> > This issue only appears when we run performance test and the CPU
> > is 
> > high. The cluster and log is as below. The Pacemaker will restart
> > the 
> > Slave Side pgsql-ha resource about every two minutes.
> >  
> > Take the following scenario for example:（when the pgsqlms RA is 
> > called, we print the log “execute the command start (command)”.
> > When 
> > the command is returned, we print the log “execute the command stop
> > (Command) (result)”）
> > 1.     We could see that pacemaker call “pgsqlms monitor” about
> > every
> > 15 seconds. And it return $OCF_SUCCESS 2.     In calls monitor
> > command 
> > again at 13:56:16, and then it reports timeout error error
> > 13:56:18. 
> > It is only 2 seconds but it reports “timeout=10000ms”
> > 3.     In other logs, sometimes after 15 minutes, there is no
> > “execute 
> > the command start monitor” printed and it reports timeout error 
> > directly.
> >  
> > Could you please tell how to debug or resolve such issue?
> >  
> > The log:
> >  
> > Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: execute the
> > command 
> > start monitor Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: 
> > _confirm_role start Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]:
> > INFO: 
> > _confirm_role stop
> > 0
> > Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: execute the
> > command 
> > stop monitor 0 Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: 
> > execute the command start monitor Jan 10 13:55:52 sds2 
> > pgsqlms(pgsqld)[5477]: INFO: _confirm_role start Jan 10 13:55:52
> > sds2 
> > pgsqlms(pgsqld)[5477]: INFO: _confirm_role stop
> > 0
> > Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: execute the
> > command 
> > stop monitor 0 Jan 10 13:56:02 sds2 crmd[26096]:  notice: High CPU 
> > load detected:
> > 426.779999
> > Jan 10 13:56:16 sds2 pgsqlms(pgsqld)[5606]: INFO: execute the
> > command 
> > start monitor Jan 10 13:56:18 sds2 lrmd[26093]: warning: 
> > pgsqld_monitor_16000 process (PID 5606) timed out
> 
> There's something more going on than in this log snippet. Notice the
> process that timed out (5606) is not one of the processes that logged
> above (5240 and 5477).
> 
> Generally, once load gets that high, it's very difficult to maintain
> responsiveness, and the expectation is that another node will fence
> it.
> But it can often be worked around with high timeouts, and/or you can
> use rules to set higher timeouts or maintenance mode during times
> when high load is expected.
> 
> > Jan 10 13:56:18 sds2 lrmd[26093]: warning:
> > pgsqld_monitor_16000:5606
> > - timed out after 10000ms
> > Jan 10 13:56:18 sds2 crmd[26096]:   error: Result of monitor
> > operation 
> > for pgsqld on db2: Timed Out | call=102
> > key=pgsqld_monitor_16000 timeout=10000ms Jan 10 13:56:18 sds2 
> > crmd[26096]:  notice: db2-
> > pgsqld_monitor_16000:102 [ /tmp:5432 - accepting connections\n ]
> > Jan 
> > 10 13:56:18 sds2 crmd[26096]:  notice: State transition S_IDLE -> 
> > S_POLICY_ENGINE | input=I_PE_CALC cause=C_FSA_INTERNAL 
> > origin=abort_transition_graph Jan 10 13:56:19 sds2 pengine[26095]: 
> > warning: Processing failed op monitor for pgsqld:0 on db2: unknown 
> > error (1) Jan 10 13:56:19 sds2 pengine[26095]: warning: Processing 
> > failed op start for pgsqld:1 on db1: unknown error (1) Jan 10
> > 13:56:19 
> > sds2 pengine[26095]: warning: Forcing pgsql-ha away from db1 after 
> > 1000000 failures (max=1000000) Jan 10 13:56:19 sds2
> > pengine[26095]: 
> > warning: Forcing pgsql-ha away from db1 after 1000000 failures 
> > (max=1000000) Jan 10 13:56:19 sds2 pengine[26095]:  notice:
> > Recover 
> > pgsqld:0#011(Slave db2) Jan 10 13:56:19 sds2 pengine[26095]: 
> > notice: 
> > Calculated transition 37, saving inputs in 
> > /var/lib/pacemaker/pengine/pe-input-1251.bz2
> >  
> >  
> > The Cluster Configuration:
> > 2 nodes and 13 resources configured
> >  
> > Online: [ db1 db2 ]
> >  
> > Full list of resources:
> >  
> > Clone Set: dlm-clone [dlm]
> >      Started: [ db1 db2 ]
> > Clone Set: clvmd-clone [clvmd]
> >      Started: [ db1 db2 ]
> > ipmi_node1     (stonith:fence_ipmilan):        Started db2
> > ipmi_node2     (stonith:fence_ipmilan):        Started db1 Clone
> > Set: 
> > clusterfs-clone [clusterfs]
> >      Started: [ db1 db2 ]
> > Master/Slave Set: pgsql-ha [pgsqld]>
> >       Masters: [ db1 ]
> > Slaves: [ db2 ]
> > Resource Group: mastergroup
> >      db1-vip    (ocf::heartbeat:IPaddr2):       Started
> >      rep-vip    (ocf::heartbeat:IPaddr2):       Started Resource 
> > Group: slavegroup
> >      db2-vip    (ocf::heartbeat:IPaddr2):       Started
> >  
> >  
> > pcs resource show pgsql-ha
> > Master: pgsql-ha
> >   Meta Attrs: interleave=true notify=true
> >   Resource: pgsqld (class=ocf provider=heartbeat type=pgsqlms)
> >    Attributes: bindir=/usr/local/pgsql/bin
> > pgdata=/home/postgres/data
> >    Operations: start interval=0s timeout=160s (pgsqld-start-
> > interval-
> > 0s)
> >                stop interval=0s timeout=60s (pgsqld-stop-interval-
> > 0s)
> >                promote interval=0s timeout=130s (pgsqld-promote-
> > interval-0s)
> >                demote interval=0s timeout=120s (pgsqld-demote-
> > interval-0s)
> >                monitor interval=15s role=Master timeout=10s
> > (pgsqld-
> > monitor-interval-15s)
> >                monitor interval=16s role=Slave timeout=10s (pgsqld-
> > monitor-interval-16s)
> >                notify interval=0s timeout=60s (pgsqld-notify-
> > interval-0s)
> > _______________________________________________
> > Users mailing list: Users at clusterlabs.org 
> > http://lists.clusterlabs.org/mailman/listinfo/users
> > 
> > Project Home: http://www.clusterlabs.org Getting started: 
> > http://www.clusterlabs.org/doc/Cluster_from_Scratch.
> > pdf
> > Bugs: http://bugs.clusterlabs.org
> 
> --
> Ken Gaillot <kgaillot at redhat.com>
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org http://lists.clusterlabs.or
> g/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.
> pdf
> Bugs: http://bugs.clusterlabs.org
-- 
Ken Gaillot <kgaillot at redhat.com>