[ClusterLabs] pacemaker reports monitor timeout while CPU is high

Wed Jan 10 04:40:51 EST 2018

Hello,

This issue only appears when we run performance test and the CPU is high. The cluster and log is as below. The Pacemaker will restart the Slave Side pgsql-ha resource about every two minutes.

Take the following scenario for example:（when the pgsqlms RA is called, we print the log “execute the command start (command)”. When the command is returned, we print the log “execute the command stop (Command) (result)”）

1.     We could see that pacemaker call “pgsqlms monitor” about every 15 seconds. And it return $OCF_SUCCESS

2.     In calls monitor command again at 13:56:16, and then it reports timeout error error 13:56:18. It is only 2 seconds but it reports “timeout=10000ms”

3.     In other logs, sometimes after 15 minutes, there is no “execute the command start monitor” printed and it reports timeout error directly.

Could you please tell how to debug or resolve such issue?

The log:

Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: execute the command start monitor
Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: _confirm_role start
Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: _confirm_role stop 0
Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: execute the command stop monitor 0
Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: execute the command start monitor
Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: _confirm_role start
Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: _confirm_role stop 0
Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: execute the command stop monitor 0
Jan 10 13:56:02 sds2 crmd[26096]:  notice: High CPU load detected: 426.779999
Jan 10 13:56:16 sds2 pgsqlms(pgsqld)[5606]: INFO: execute the command start monitor
Jan 10 13:56:18 sds2 lrmd[26093]: warning: pgsqld_monitor_16000 process (PID 5606) timed out
Jan 10 13:56:18 sds2 lrmd[26093]: warning: pgsqld_monitor_16000:5606 - timed out after 10000ms
Jan 10 13:56:18 sds2 crmd[26096]:   error: Result of monitor operation for pgsqld on db2: Timed Out | call=102 key=pgsqld_monitor_16000 timeout=10000ms
Jan 10 13:56:18 sds2 crmd[26096]:  notice: db2-pgsqld_monitor_16000:102 [ /tmp:5432 - accepting connections\n ]
Jan 10 13:56:18 sds2 crmd[26096]:  notice: State transition S_IDLE -> S_POLICY_ENGINE | input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph
Jan 10 13:56:19 sds2 pengine[26095]: warning: Processing failed op monitor for pgsqld:0 on db2: unknown error (1)
Jan 10 13:56:19 sds2 pengine[26095]: warning: Processing failed op start for pgsqld:1 on db1: unknown error (1)
Jan 10 13:56:19 sds2 pengine[26095]: warning: Forcing pgsql-ha away from db1 after 1000000 failures (max=1000000)
Jan 10 13:56:19 sds2 pengine[26095]: warning: Forcing pgsql-ha away from db1 after 1000000 failures (max=1000000)
Jan 10 13:56:19 sds2 pengine[26095]:  notice: Recover pgsqld:0#011(Slave db2)
Jan 10 13:56:19 sds2 pengine[26095]:  notice: Calculated transition 37, saving inputs in /var/lib/pacemaker/pengine/pe-input-1251.bz2

The Cluster Configuration:
2 nodes and 13 resources configured

Online: [ db1 db2 ]

Full list of resources:

Clone Set: dlm-clone [dlm]
     Started: [ db1 db2 ]
Clone Set: clvmd-clone [clvmd]
     Started: [ db1 db2 ]
ipmi_node1     (stonith:fence_ipmilan):        Started db2
ipmi_node2     (stonith:fence_ipmilan):        Started db1
Clone Set: clusterfs-clone [clusterfs]
     Started: [ db1 db2 ]
Master/Slave Set: pgsql-ha [pgsqld]>

      Masters: [ db1 ]

Slaves: [ db2 ]
Resource Group: mastergroup
     db1-vip    (ocf::heartbeat:IPaddr2):       Started
     rep-vip    (ocf::heartbeat:IPaddr2):       Started
Resource Group: slavegroup
     db2-vip    (ocf::heartbeat:IPaddr2):       Started

pcs resource show pgsql-ha
Master: pgsql-ha
  Meta Attrs: interleave=true notify=true
  Resource: pgsqld (class=ocf provider=heartbeat type=pgsqlms)
   Attributes: bindir=/usr/local/pgsql/bin pgdata=/home/postgres/data
   Operations: start interval=0s timeout=160s (pgsqld-start-interval-0s)
               stop interval=0s timeout=60s (pgsqld-stop-interval-0s)
               promote interval=0s timeout=130s (pgsqld-promote-interval-0s)
               demote interval=0s timeout=120s (pgsqld-demote-interval-0s)
               monitor interval=15s role=Master timeout=10s (pgsqld-monitor-interval-15s)
               monitor interval=16s role=Slave timeout=10s (pgsqld-monitor-interval-16s)
               notify interval=0s timeout=60s (pgsqld-notify-interval-0s)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20180110/88e7c872/attachment-0002.html>