[ClusterLabs] Antw: Antw: Re: Never join a list without a problem...

Thu Mar 9 08:57:54 CET 2017

>>> Jeffrey Westgate <Jeffrey.Westgate at arkansas.gov> schrieb am 08.03.2017 um 16:58
in Nachricht
<A36B14FA9AA67F4E836C0EE59DEA89C4015B217056 at CM-SAS-MBX-07.sas.arkgov.net>:
> Ok. 
> 
> Been running monit for a few days, and atop (running a script to capture an 
> atop output every 10 seconds for an hour, rotate the log, and do it again; 
> runs from midnight to midnight, changes the date, and does it again).  I 
> correlate between the atop logs, nagios alerts, and monit, to try to find a 
> trigger.  Like trying to find a particular snowflake in Alaska in January.
> 
> Have had a handful of episodes with all the monitors running.  We have 
> determined nothing. Nothing significantly changes from normal/regular to high 
> host load.

What does that mean?: Are you detecting high load, but no culprit? or are you failing to detect high load?

> 
> It's a VMWare/ESXi-hosted VM, so we moved it to a different host and 
> different datastore (so, effectively new CPU, memory, nic, disk, video... 
> basically all "new" hardware.  still have episodes.

I some driver does not cheat on I/O, a slow device should also be indicated by high load.

> 
> Was running the "VMWare provided" vmtools.  removed and replaced with 
> open-vm-tools this morning.  just had another episode.

Have you tried without? I know VMware demands them, but...
And if you are running them, can't you also monitor the performance from vCenter (or so)?

> 
> was running atop interactively when the episode started - the only thing that 
> seems to change is the hostload goes up.  momentary spike in "avio" for the 
> disk -- all the way up to 25 msecs. lasted for one ten-second slice from atop.

Your VMware isn't migrating VMs for fun, does it?

> 
> no zombies, no wait, no spike in network, transport, mem use, disk 
> reads/writes... nothing I can see (and by I, I mean "we" as we have three 
> people looking)
> 
> I've got other boxes running the same OS - updated them at the same time, so 
> patch level is all same.  No similar issues.  The only thing I have different 
> is these two are running pacemaker, corosync, keepalived.  maybe when they 
> were updated, they need a library I don't have? 
> 
> running     /usr/sbin/iotop -obtqqq > /var/log/iotop.log -- no red flags there.  
> so - not OS, not IO, not hardware (virtual as it is...) ... only leaves 
> software.

Hmm...

> 
> Maybe pacemaker is just incompatible with:
> 
> Scientific Linux release 6.5 (Carbon)
> kernel  2.6.32-642.15.1.el6.x86_64

Oh, from which museum is that? ;-) The SLES11 SP4 kernel is also quite old, but it is 3.0.101 at least...

> 
> ??
> 
> At this point it's more of a curiosity than an out and out problem, as 
> performance does not seem to be impacted noticeably.  Packet-in, packet-out 
> seems unperturbed. Same cannot be send for us administrators...

If it's not agianst your political conviction, you could try one of these: openSUSE Leap 42.2 (free), SLES11 SP4 (eval license), SLES 12 SP2 (eval license). The SLES eval license gives you free updates for 60 days (no ther limit known), and possibly a free call from the sales representative afterwads ;-) Not that plain SLES does not include cluster stuff (it's in the HA component licensed separately). For simplicity you could use the "SLES for SAP applications" variant that is bundled with the HA stuff (that's what we use), and you can do a "SAP-free installation" of course ;-)

Regards,
Ulrich