[ClusterLabs] Antw: Re: Never join a list without a problem...
Ulrich Windl
Ulrich.Windl at rz.uni-regensburg.de
Fri Mar 3 02:04:22 EST 2017
>>> Jeffrey Westgate <Jeffrey.Westgate at arkansas.gov> schrieb am 02.03.2017 um
17:32
in Nachricht
<A36B14FA9AA67F4E836C0EE59DEA89C4015B212CD5 at CM-SAS-MBX-07.sas.arkgov.net>:
> Since we have both pieces of the load-balanced cluster doing the same thing
-
> for still-as-yet unidentified reasons - we've put atop on one and sysdig on
the
> other. Running atop at 10 second slices, hoping it will catch something.
> While configuring it yesterday, that server went into it's 'episode', but
> there was nothing in the atop log to show anything. Nothing else changed
> except the cpu load average. No increase in any other parameter.
>
> frustrating.
Hi!
You could try the monit-approach (I could provide an RPM with a
"recent-enough" monit compiled for SLES11 SP4 (x86-64) if you need it).
The part that monitors unusual load looks like this here:
check system host.domain.org
if loadavg (1min) > 8 then exec "/var/lib/monit/log-top.sh"
if loadavg (5min) > 4 then exec "/var/lib/monit/log-top.sh"
if loadavg (15min) > 2 then exec "/var/lib/monit/log-top.sh"
if memory usage > 90% for 2 cycles then exec "/var/lib/monit/log-top.sh"
if swap usage > 25% for 2 cycles then exec "/var/lib/monit/log-top.sh"
if swap usage > 50% then exec "/var/lib/monit/log-top.sh"
if cpu usage > 99% for 15 cycles then alert
if cpu usage (user) > 90% for 30 cycles then alert
if cpu usage (system) > 20% for 2 cycles then exec
"/var/lib/monit/log-top.s
h"
if cpu usage (wait) > 80% then exec "/var/lib/monit/log-top.sh"
group local
### all numbers are a matter of taste ;-)
And my script (in lack of better ideas) looks like this:
#!/bin/sh
{
echo "========== $(/bin/date) =========="
/usr/bin/mpstat
echo "---"
/usr/bin/vmstat
echo "---"
/usr/bin/top -b -n 1 -Hi
} >> /var/log/monit/top.log
Regards,
Ulrich
>
>
> ________________________________________
> From: Adam Spiers [aspiers at suse.com]
> Sent: Wednesday, March 01, 2017 5:33 AM
> To: Cluster Labs - All topics related to open-source clustering welcomed
> Cc: Jeffrey Westgate
> Subject: Re: [ClusterLabs] Never join a list without a problem...
>
> Ferenc Wágner <wferi at niif.hu> wrote:
>>Jeffrey Westgate <Jeffrey.Westgate at arkansas.gov> writes:
>>
>>> We use Nagios to monitor, and once every 20 to 40 hours - sometimes
>>> longer, and we cannot set a clock by it - while the machine is 95%
>>> idle (or more according to 'top'), the host load shoots up to 50 or
>>> 60%. It takes about 20 minutes to peak, and another 30 to 45 minutes
>>> to come back down to baseline, which is mostly 0.00. (attached
>>> hostload.pdf) This happens to both machines, randomly, and is
>>> concerning, as we'd like to find what's causing it and resolve it.
>>
>>Try running atop (http://www.atoptool.nl/). It collects and logs
>>process accounting info, allowing you to step back in time and check
>>resource usage in the past.
>
> Nice, I didn't know atop could also log the collected data for future
> analysis.
>
> If you want to capture even more detail, sysdig is superb:
>
> http://www.sysdig.org/
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
More information about the Users
mailing list