[Pacemaker] TOTEM: Process pause detected? Leading to STONITH...

Fri Aug 12 05:58:54 EDT 2011

Hi Steven,

On 12.08.2011, at 02:11, Steven Dake wrote:
>> We've had another one of these this morning:
>> "Process pause detected for 11763 ms, flushing membership messages."
>> According to the graphs that are generated from Nagios data, the load of that system 
>> jumped from 1.0 to 5.1 ca. 2 minutes before this event, stayed at that value for 
>> ~5 minutes then dropped to below 1 afterwards. 10 Minutes later the system got shot,
> 
> Did nagios possibly block for 10+ seconds during this time as well?  In
> this case, it wouldn't detect any spikes or delays in scheduling.

Nagios is not running on that machine, it runs on a dedicated server and is
monitoring our servers using a client-side component (NRPE). 
These checks run at a scheduled interval of 5 Minutes, so it's not possible 
to get a more detailed view at that particular time frame.  
Especially I couldn't tell, if multiple/all processes were frozen for 
10+ seconds or only the Corosync process.

> Are you running in a virtual machine or on old/slow hardware?

I'm afraid not. It's an IBM x3650 M3 machine with 2x Xeon E5620 Quadcore CPUs
and 32 GB RAM. The machine is mirroring two of its disk arrays via DRBD to an 
identical twin forming an Active/Standby setup. The active machine runs a 
MySQL server on XFS (which is only mounted on the active node) and both machines
share a large amount of small files on an OCFS2 filesystem.
I originally planned to utilize the standby node for jobs that access the shared 
data files but don't need direct access to the database files. However, since 
the whole setup has not been working very stable so far, the standby machine does 
not run any such jobs except for the daily backup of the shared data files (which 
only needs to be done on one of the nodes).

> RE deadline cpu scheduler, the only thing I can find about that topic is
> a new scheduling class.  Corosync doesn't take advantage of that
> scheduling class (its not in the linux 3.0 glibc man pages - if it is
> there, we don't know how to use it).

Sorry, I think I mixed that up #-/
The deadline scheduler is used for I/O operations, not as a CPU scheduler.
So it shouldn't matter in this case.

> I would really like someone that has these process pause problems to
> test a patch I have posted to see if it rectifies the situation.  Our
> significant QE team at Red Hat doesn't see these problems and I can't
> generate them in engineering.  It is possible your device drivers are
> taking spinlocks for extended periods or some other kernel problem is
> occurring.
> 
> If you feel up to the task of building your own corosync, try out this
> patch:
> 
> http://marc.info/?l=openais&m=130989380207300&w=2

I'd love to test this, but it'll take a few weeks. 
The machines are already productive and we don't have comparable test machines.
I'm currently (acutally ;) having a few days off, and when I'm back at the office, 
I'll update the Corosync version to v1.4.1 (because of the retransmit list 
problem) -- does the patch cleanly apply to v1.4.1?
I'll then need to schedule some downtime for the update and then we'll have to wait 
what happens, since we've had around 12 days of stable operations twice in a row 
(yay!).
If you still want to reproduce the problem, I could get you all sorts of details 
regarding our setup.

-- 
Sebastian