[ClusterLabs] Antw: Re: Never join a list without a problem...

Wed Mar 1 03:42:53 EST 2017

>>> Jeffrey Westgate <Jeffrey.Westgate at arkansas.gov> schrieb am 27.02.2017 um 14:26
in Nachricht
<A36B14FA9AA67F4E836C0EE59DEA89C4015B20CAB0 at CM-SAS-MBX-07.sas.arkgov.net>:
> Thanks, Ken. 
> 
> Our late guru was the admin who set all this up, and it's been rock solid 
> until recent oddities started cropping up.  They still function fine - they've 
> just developed some... quirks.
> 
> I found the solution before I got your reply, which was essentially what we 
> did; update all but pacemaker, reboot, stop pacemaker, update pacemaker, 
> reboot.  That process was necessary because they've been running sooo long, 
> pacemaker would not stop.  it would try, then seemingly stall after several 
> minutes.
> 
> We're good now, up-to-date-wise, and stuck only with the initial issue we were 
> hoping to eliminate by updating/patching EVERYthing.  And we honestly don't 
> know what may be causing it.
> 
> We use Nagios to monitor, and once every 20 to 40 hours - sometimes longer, 
> and we cannot set a clock by it - while the machine is 95% idle (or more 
> according to 'top'), the host load shoots up to 50 or 60%.  It takes about 20 
> minutes to peak, and another 30 to 45 minutes to come back down to baseline, 
> which is mostly 0.00.  (attached hostload.pdf)  This happens to both 
> machines, randomly, and is concerning, as we'd like to find what's causing it 
> and resolve it.

We use SLES11 here, and it took me a really long time to find out what is causing nightly load peaks on our servers. It turned out tho be the rebuild of the manual database (mandb). It didn't show in Nagios load statistics, but in monit alerts (on some machines we use both). In monit you can run a script when some condition is met. So  I constructed a "capture script" to find the guilty parties ;-)

However the peaks were so short that it took many runs to find it. Here the load was back to normal already, but monit had reported an event like "cpu system usage of 30.2% matches resource limit [cpu system usage>20.0%]":

Sat May 11 01:31:13 CEST 2013
top - 01:31:14 up 2 days,  9:31,  0 users,  load average: 0.91, 0.31, 0.15
Tasks: 114 total,   2 running, 112 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us,  0.1%sy,  0.0%ni, 99.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   1065628k total,  1055292k used,    10336k free,   143708k buffers
Swap:  2097148k total,        0k used,  2097148k free,   578736k cached

  PID USER      PR  NI  VIRT  RES  SHR S   %CPU %MEM    TIME+  COMMAND
 2832 root      20   0  8916 1060  776 R      0  0.1   0:00.00 top
 2910 man       30  10     8    4    0 R      0  0.0   0:00.00 mandb

Maybe this helps.

Regards,
Ulrich

> 
> We were hoping "uptime kernel bug", but patching has not helped.  There 
> seems to be no increase in the number of processes running, and the processes 
> running do not take any more cpu time.  They are DNS forwarding resolvers, 
> but there is no correlation between dns requests and load increase - sometimes 
> (like this morning) it rises around 1 AM when the dns load is minimal.
> 
> The oddity is - these are the only two boxes with this issue, and we have a 
> couple dozen at the same OS and level.  Only these two, with this role and 
> this particular package set have the issue.
> 
> --
> Jeff