[ClusterLabs] Never join a list without a problem...

Thu Mar 2 16:32:02 UTC 2017

Since we have both pieces of the load-balanced cluster doing the same thing - for still-as-yet unidentified reasons - we've put atop on one and sysdig on the other.  Running atop at 10 second slices, hoping it will catch something.  While configuring it yesterday, that server went into it's 'episode', but there was nothing in the atop log to show anything.  Nothing else changed except the cpu load average.  No increase in any other parameter.

frustrating.

________________________________________
From: Adam Spiers [aspiers at suse.com]
Sent: Wednesday, March 01, 2017 5:33 AM
To: Cluster Labs - All topics related to open-source clustering welcomed
Cc: Jeffrey Westgate
Subject: Re: [ClusterLabs] Never join a list without a problem...

Ferenc Wágner <wferi at niif.hu> wrote:
>Jeffrey Westgate <Jeffrey.Westgate at arkansas.gov> writes:
>
>> We use Nagios to monitor, and once every 20 to 40 hours - sometimes
>> longer, and we cannot set a clock by it - while the machine is 95%
>> idle (or more according to 'top'), the host load shoots up to 50 or
>> 60%.  It takes about 20 minutes to peak, and another 30 to 45 minutes
>> to come back down to baseline, which is mostly 0.00.  (attached
>> hostload.pdf) This happens to both machines, randomly, and is
>> concerning, as we'd like to find what's causing it and resolve it.
>
>Try running atop (http://www.atoptool.nl/).  It collects and logs
>process accounting info, allowing you to step back in time and check
>resource usage in the past.

Nice, I didn't know atop could also log the collected data for future
analysis.

If you want to capture even more detail, sysdig is superb:

    http://www.sysdig.org/