[Pacemaker] TOTEM: Process pause detected? Leading to STONITH...

Thu Aug 11 06:05:53 EDT 2011

Hi,

On 04.08.2011, at 18:21, Steven Dake wrote:

>> Jul 31 03:51:02 node01 corosync[5870]:  [TOTEM ] Process pause detected
>> for 11149 ms, flushing membership messages.
> 
> This process pause message indicates the scheduler doesn't schedule
> corosync for 11 seconds which is greater then the failure detection
> timeouts.  What does your config file look like?  What load are you running?

We've had another one of these this morning:
"Process pause detected for 11763 ms, flushing membership messages."
According to the graphs that are generated from Nagios data, the load of that system 
jumped from 1.0 to 5.1 ca. 2 minutes before this event, stayed at that value for 
~5 minutes then dropped to below 1 afterwards. 10 Minutes later the system got shot,
probably because the OCFS2 got confused by the node leaving the cluster.
At that time, the machine was only the standby node. The only things that could 
have been running then, are a daily backup run (TSM) that starts the night before 
and takes a few hours to complete - and the OCFS2-related processes (the backup of 
the OCFS2 filesystem is done on that machine).

What can I do to investigate this behavior? We've switched to the "deadline" cpu 
scheduler before the July 31st event. Could this cause this kind of behavior?
I was under the impression, that 'deadline' was designed to prevent exactly these
kinds of situations.
Further increasing the timeout above the current value of 10s doesn't look like
it's the solution for this problem.

The configuration is unchanged from the one I posted on August 4th.
The funny thing is, that the cluster did not show any problems since July 31st.

Thanks in advance!

-- 
Sebastian