[ClusterLabs] Never join a list without a problem...

Mon Feb 27 19:01:26 EST 2017

On 02/27/2017 01:48 PM, Jeffrey Westgate wrote:
> I think I may be on to something.  It seems that every time my boxes start showing increased host load, the preceding change that takes place is:
> 
>  crmd:     info: throttle_send_command:	New throttle mode: 0100 (was 0000)
> 
> I'm attaching the last 50-odd lines from the corosync.log.  It just happens that  - at the moment - our host load on this box is coming back down.  No host load issue (0.00 load) immediately preceding this part of the log.
> 
> I know the log shows them in reverse order, but it shows them as the same log item, and printed at the same time.  I'm assuming the throttle change takes place and that increases the load, not the other way around....
> 
> So - what is the throttle mode?
> 
> --
> Jeff Westgate
> DIS UNIX/Linux System Administrator

Actually it is the other way around. When Pacemaker detects high load on
a node, it "throttles" by reducing the number of operations it will
execute concurrently (to avoid making a bad situation worse).

So, what caused the load to go up is still a mystery.

There have been some cases where corosync started using 100% CPU, but
since you mentioned that processes aren't taking any more CPU, it
doesn't sound like the same issue.

> ------------------------------
> Message: 3
> Date: Mon, 27 Feb 2017 13:26:30 +0000
> From: Jeffrey Westgate <Jeffrey.Westgate at arkansas.gov>
> To: "users at clusterlabs.org" <users at clusterlabs.org>
> Subject: Re: [ClusterLabs] Never join a list without a problem...
> Message-ID:
>         <A36B14FA9AA67F4E836C0EE59DEA89C4015B20CAB0 at CM-SAS-MBX-07.sas.arkgov.net>
> 
> Content-Type: text/plain; charset="us-ascii"
> 
> Thanks, Ken.
> 
> Our late guru was the admin who set all this up, and it's been rock solid until recent oddities started cropping up.  They still function fine - they've just developed some... quirks.
> 
> I found the solution before I got your reply, which was essentially what we did; update all but pacemaker, reboot, stop pacemaker, update pacemaker, reboot.  That process was necessary because they've been running sooo long, pacemaker would not stop.  it would try, then seemingly stall after several minutes.
> 
> We're good now, up-to-date-wise, and stuck only with the initial issue we were hoping to eliminate by updating/patching EVERYthing.  And we honestly don't know what may be causing it.
> 
> We use Nagios to monitor, and once every 20 to 40 hours - sometimes longer, and we cannot set a clock by it - while the machine is 95% idle (or more according to 'top'), the host load shoots up to 50 or 60%.  It takes about 20 minutes to peak, and another 30 to 45 minutes to come back down to baseline, which is mostly 0.00.  (attached hostload.pdf)  This happens to both machines, randomly, and is concerning, as we'd like to find what's causing it and resolve it.
> 
> We were hoping "uptime kernel bug", but patching has not helped.  There seems to be no increase in the number of processes running, and the processes running do not take any more cpu time.  They are DNS forwarding resolvers, but there is no correlation between dns requests and load increase - sometimes (like this morning) it rises around 1 AM when the dns load is minimal.
> 
> The oddity is - these are the only two boxes with this issue, and we have a couple dozen at the same OS and level.  Only these two, with this role and this particular package set have the issue.
> 
> --
> Jeff