[ClusterLabs] Antw: Re: Never join a list without a problem...

Fri Mar 3 02:04:22 EST 2017

>>> Jeffrey Westgate <Jeffrey.Westgate at arkansas.gov> schrieb am 02.03.2017 um
17:32
in Nachricht
<A36B14FA9AA67F4E836C0EE59DEA89C4015B212CD5 at CM-SAS-MBX-07.sas.arkgov.net>:
> Since we have both pieces of the load-balanced cluster doing the same thing
- 
> for still-as-yet unidentified reasons - we've put atop on one and sysdig on
the 
> other.  Running atop at 10 second slices, hoping it will catch something.  
> While configuring it yesterday, that server went into it's 'episode', but 
> there was nothing in the atop log to show anything.  Nothing else changed 
> except the cpu load average.  No increase in any other parameter.
> 
> frustrating.

Hi!

You could try the monit-approach (I could provide an RPM with a
"recent-enough" monit compiled for SLES11 SP4 (x86-64) if you need it).

The part that monitors unusual load looks like this here:
  check system host.domain.org
    if loadavg (1min) > 8 then exec "/var/lib/monit/log-top.sh"
    if loadavg (5min) > 4 then exec "/var/lib/monit/log-top.sh"
    if loadavg (15min) > 2 then exec "/var/lib/monit/log-top.sh"
    if memory usage > 90% for 2 cycles then exec "/var/lib/monit/log-top.sh"
    if swap usage > 25% for 2 cycles then exec "/var/lib/monit/log-top.sh"
    if swap usage > 50% then exec "/var/lib/monit/log-top.sh"
    if cpu usage > 99% for 15 cycles then alert
    if cpu usage (user) > 90% for 30 cycles then alert
    if cpu usage (system) > 20% for 2 cycles then exec
"/var/lib/monit/log-top.s
h"
    if cpu usage (wait) > 80% then exec "/var/lib/monit/log-top.sh"
    group local
### all numbers are a matter of taste ;-)
And my script (in lack of better ideas) looks like this:
#!/bin/sh
{
    echo "========== $(/bin/date) =========="
    /usr/bin/mpstat
    echo "---"
    /usr/bin/vmstat
    echo "---"
    /usr/bin/top -b -n 1 -Hi
} >> /var/log/monit/top.log

Regards,
Ulrich

> 
> 
> ________________________________________
> From: Adam Spiers [aspiers at suse.com]
> Sent: Wednesday, March 01, 2017 5:33 AM
> To: Cluster Labs - All topics related to open-source clustering welcomed
> Cc: Jeffrey Westgate
> Subject: Re: [ClusterLabs] Never join a list without a problem...
> 
> Ferenc Wágner <wferi at niif.hu> wrote:
>>Jeffrey Westgate <Jeffrey.Westgate at arkansas.gov> writes:
>>
>>> We use Nagios to monitor, and once every 20 to 40 hours - sometimes
>>> longer, and we cannot set a clock by it - while the machine is 95%
>>> idle (or more according to 'top'), the host load shoots up to 50 or
>>> 60%.  It takes about 20 minutes to peak, and another 30 to 45 minutes
>>> to come back down to baseline, which is mostly 0.00.  (attached
>>> hostload.pdf) This happens to both machines, randomly, and is
>>> concerning, as we'd like to find what's causing it and resolve it.
>>
>>Try running atop (http://www.atoptool.nl/).  It collects and logs
>>process accounting info, allowing you to step back in time and check
>>resource usage in the past.
> 
> Nice, I didn't know atop could also log the collected data for future
> analysis.
> 
> If you want to capture even more detail, sysdig is superb:
> 
>     http://www.sysdig.org/ 
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org 
> http://lists.clusterlabs.org/mailman/listinfo/users 
> 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> Bugs: http://bugs.clusterlabs.org