[Pacemaker] Help with OCFS2 / DLM Stability

Wed Mar 10 10:34:43 EST 2010

On Wed, 2010-03-10 at 13:28 +0100, Dejan Muhamedagic wrote: 

	Hi,

	On Tue, Mar 09, 2010 at 11:37:02AM -0000, Darren.Mansell at opengi.co.uk wrote:
	> Hi everyone.
	> 
	>  
	> 
	> Further to some discussions a couple of weeks ago with regard to OCFS2
	> on SLES 11 HAE I'm looking to finally nail this problem.
	> 
	> We have a 3 node cluster that has a STONITH shootout every week. This
	> morning one node got stuck in a state where it couldn't be fenced due
	> the RSA not being responsive.
	> 
	> I'm not sure if the problem is due to:
	> 
	> *         Network interruption causing Totem failures.
	> *         Java (Tomcat) processes falling over.

	I suppose that those are activequote and activequoteadmin. You
	should increase the timeouts, 10 seconds is too short in general,
	and for java/tomcat probably even more so.

I've increased those. As the monitor operation in the LSB script is just a pgrep I don't think it matters that the monitor interval is 10s but the timeout is 30s. Is this correct? 

	> *         DLM falling over.
	> *         Any of the above in any combination.
	> 
	> I've attached a hb_report. Could you see if you can see anything?

	Any good reason to ignore quorum? For a three node cluster you
	should remove the no-quorum-policy property or, perhaps because
	of ocfs2, set it to freeze.

Oops. It was a 2 node cluster. The 3rd node was added and obviously that property was missed. 

	Pacemaker is 1.0.3, perhaps it's time to upgrade too. There is a
	SLE11 HAE update available.

	>From the logs:

	Mar  9 06:28:43 OGG-ACTIVEQUOTE-02 pengine: [5540]: WARN: unpack_rsc_op: Processing failed op activequote:1_monitor_10000 on OGG-ACTIVEQUOTE-03: unknown exec error

	Interestingly, there is no lrmd log for this on 03.

	Then there are several operation timeouts, perhaps due to ocfs2
	hanging, two activequote and activequoteadmin stop operations
	could not be killed even with -9, so they were probably waiting
	for the disk.

	Mar  9 06:29:40 OGG-ACTIVEQUOTE-02 openais[5439]: [crm  ] info: pcmk_peer_update: lost: OGG-ACTIVEQUOTE-03 504997642

	Do you know why the node vanished? You should try to keep your
	networking healthy.

This is amazingly accurate. It turns out the datacentre had some scheduled maintenance we weren't aware of and pulled the network cable out causing this. Case solved. Although it doesn't explain what happened on previous occasions. 

	Thanks,

	Dejan

Thanks for your help!

Darren

-------------- next part --------------
A non-text attachment was scrubbed...
Name: winmail.dat
Type: application/ms-tnef
Size: 4348 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20100310/fddcfc53/attachment-0003.bin>