[ClusterLabs] Antw: 答复: Antw: pacemaker reports monitor timeout while CPU is high
Ulrich Windl
Ulrich.Windl at rz.uni-regensburg.de
Thu Jan 11 03:18:53 EST 2018
Hi!
A few years ago I was playing with cgroups, getting quite interesting (useful)
results, but applying the cgroups to existing and newly started processes was
quite hard to integrate into the OS, so I did not proceed on that way. I think
cgroups is even more powerful today, but I haven't followed the ease of using
it in systems based on systemd (which uses cgroups heavily AFAIK).
In short: You may be unable to control the client processes, but you could
control the server processes the clients start.
Regards,
Ulrich
>>> ??? <fanguoteng at highgo.com> schrieb am 11.01.2018 um 05:01 in Nachricht
<492a1ace20c04e85bc4979307af2a0be at EX01.highgo.com>:
> Ulrich,
>
> Thank you very much for the help. When we do the performance test, our
> application(pgsql-ha) will start more than 500 process to process the client
> request. Is it possible to make this issue?
>
> Is it any workaround or method to make pacemaker not restart the resource in
> such situation? Now the system could not work if the client sends high call
> load but we could not control the client's behavior.
>
> Thanks
>
>
> -----邮件原件-----
> 发件人: Ulrich Windl [mailto:Ulrich.Windl at rz.uni-regensburg.de]
> 发送时间: 2018年1月10日 18:20
> 收件人: users at clusterlabs.org
> 主题: [ClusterLabs] Antw: pacemaker reports monitor timeout while CPU is high
>
> Hi!
>
> I only can talk for myself: In former times with HP-UX, we had severe
> performance problems when the load was in the range of 8 to 14 (I/O waits
not
> included, average for all logical CPUs), while in Linux we are getting
> problems with a load above 40 (or so) (I/O included, sum of all logical CPUs
> (which are 24)). Also I/O waits cause cluster timeouts before CPU load
> actually matters (for us).
> So with a load above 400 (not knowing your number of CPUs) it should not be
> that unusual. What is the number of threads in your system at that time?
> It might be worth the efforts binding the cluster processes to specific CPUs
> and keep other tasks away from those, but I don't have experience with
that.
> I guess the "High CPU load detected" message triggers some internal suspend
> in the cluster engine (assuming the cluster engine caused the high load). Of
> course for "external " load the measure won't help...
>
> Regards,
> Ulrich
>
>
>>>> ??? <fanguoteng at highgo.com> schrieb am 10.01.2018 um 10:40 in
>>>> Nachricht
> <4dc98a5d9be144a78fb9a187217439ed at EX01.highgo.com>:
>> Hello,
>>
>> This issue only appears when we run performance test and the CPU is high.
>> The cluster and log is as below. The Pacemaker will restart the Slave
>> Side pgsql-ha resource about every two minutes.
>>
>> Take the following scenario for example:(when the pgsqlms RA is
>> called, we print the log “execute the command start (command)”. When
>> the command is
>
>> returned, we print the log “execute the command stop (Command)
> (result)”)
>>
>> 1. We could see that pacemaker call “pgsqlms monitor” about every
15
>
>> seconds. And it return $OCF_SUCCESS
>>
>> 2. In calls monitor command again at 13:56:16, and then it reports
>> timeout error error 13:56:18. It is only 2 seconds but it reports
>> “timeout=10000ms”
>>
>> 3. In other logs, sometimes after 15 minutes, there is no “execute
the
>
>> command start monitor” printed and it reports timeout error directly.
>>
>> Could you please tell how to debug or resolve such issue?
>>
>> The log:
>>
>> Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: execute the command
>> start
>
>> monitor
>> Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: _confirm_role start
>> Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: _confirm_role stop 0
>> Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: execute the command
>> stop monitor 0 Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO:
>> execute the command start
>
>> monitor
>> Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: _confirm_role start
>> Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: _confirm_role stop 0
>> Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: execute the command
>> stop monitor 0 Jan 10 13:56:02 sds2 crmd[26096]: notice: High CPU
>> load detected:
>> 426.779999
>> Jan 10 13:56:16 sds2 pgsqlms(pgsqld)[5606]: INFO: execute the command
>> start
>
>> monitor
>> Jan 10 13:56:18 sds2 lrmd[26093]: warning: pgsqld_monitor_16000
>> process (PID
>
>> 5606) timed out
>> Jan 10 13:56:18 sds2 lrmd[26093]: warning: pgsqld_monitor_16000:5606 -
>> timed
>
>> out after 10000ms
>> Jan 10 13:56:18 sds2 crmd[26096]: error: Result of monitor operation for
>> pgsqld on db2: Timed Out | call=102 key=pgsqld_monitor_16000
> timeout=10000ms
>> Jan 10 13:56:18 sds2 crmd[26096]: notice:
>> db2-pgsqld_monitor_16000:102 [
>> /tmp:5432 - accepting connections\n ]
>> Jan 10 13:56:18 sds2 crmd[26096]: notice: State transition S_IDLE ->
>> S_POLICY_ENGINE | input=I_PE_CALC cause=C_FSA_INTERNAL
>> origin=abort_transition_graph Jan 10 13:56:19 sds2 pengine[26095]:
>> warning: Processing failed op monitor for pgsqld:0 on db2: unknown
>> error (1) Jan 10 13:56:19 sds2 pengine[26095]: warning: Processing
>> failed op start for
>
>> pgsqld:1 on db1: unknown error (1)
>> Jan 10 13:56:19 sds2 pengine[26095]: warning: Forcing pgsql-ha away
>> from db1
>
>> after 1000000 failures (max=1000000)
>> Jan 10 13:56:19 sds2 pengine[26095]: warning: Forcing pgsql-ha away
>> from db1
>
>> after 1000000 failures (max=1000000)
>> Jan 10 13:56:19 sds2 pengine[26095]: notice: Recover
>> pgsqld:0#011(Slave
>> db2)
>> Jan 10 13:56:19 sds2 pengine[26095]: notice: Calculated transition
>> 37, saving inputs in /var/lib/pacemaker/pengine/pe-input-1251.bz2
>>
>>
>> The Cluster Configuration:
>> 2 nodes and 13 resources configured
>>
>> Online: [ db1 db2 ]
>>
>> Full list of resources:
>>
>> Clone Set: dlm-clone [dlm]
>> Started: [ db1 db2 ]
>> Clone Set: clvmd-clone [clvmd]
>> Started: [ db1 db2 ]
>> ipmi_node1 (stonith:fence_ipmilan): Started db2
>> ipmi_node2 (stonith:fence_ipmilan): Started db1
>> Clone Set: clusterfs-clone [clusterfs]
>> Started: [ db1 db2 ]
>> Master/Slave Set: pgsql-ha [pgsqld]>
>>
>> Masters: [ db1 ]
>>
>> Slaves: [ db2 ]
>> Resource Group: mastergroup
>> db1-vip (ocf::heartbeat:IPaddr2): Started
>> rep-vip (ocf::heartbeat:IPaddr2): Started
>> Resource Group: slavegroup
>> db2-vip (ocf::heartbeat:IPaddr2): Started
>>
>>
>> pcs resource show pgsql-ha
>> Master: pgsql-ha
>> Meta Attrs: interleave=true notify=true
>> Resource: pgsqld (class=ocf provider=heartbeat type=pgsqlms)
>> Attributes: bindir=/usr/local/pgsql/bin pgdata=/home/postgres/data
>> Operations: start interval=0s timeout=160s (pgsqld-start-interval-0s)
>> stop interval=0s timeout=60s (pgsqld-stop-interval-0s)
>> promote interval=0s timeout=130s
> (pgsqld-promote-interval-0s)
>> demote interval=0s timeout=120s (pgsqld-demote-interval-0s)
>> monitor interval=15s role=Master timeout=10s
>> (pgsqld-monitor-interval-15s)
>> monitor interval=16s role=Slave timeout=10s
>> (pgsqld-monitor-interval-16s)
>> notify interval=0s timeout=60s
>> (pgsqld-notify-interval-0s)
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
More information about the Users
mailing list