[Pacemaker] Help with OCFS2 / DLM Stability

Darren.Mansell at opengi.co.uk Darren.Mansell at opengi.co.uk
Wed Mar 10 08:54:56 EST 2010


Sorry, please ignore this mail. Client issues!


-----Original Message-----
From: Darren.Mansell at opengi.co.uk [mailto:Darren.Mansell at opengi.co.uk] 
Sent: 10 March 2010 13:53
To: dejanmm at fastmail.fm
Cc: pacemaker at oss.clusterlabs.org
Subject: Re: Re: [Pacemaker] Help with OCFS2 / DLM Stability

On Wed, 2010-03-10 at 13:28 +0100, Dejan Muhamedagic wrote:
> Hi,
>=20
> On Tue, Mar 09, 2010 at 11:37:02AM -0000, Darren.Mansell at opengi.co.uk 
>wro=
te:
> > Hi everyone.
> >=20
> > =20
> >=20
> > Further to some discussions a couple of weeks ago with regard to 
> >OCFS2  on SLES 11 HAE I'm looking to finally nail this problem.
> >=20
> > We have a 3 node cluster that has a STONITH shootout every week. 
> >This  morning one node got stuck in a state where it couldn't be 
> >fenced due  the RSA not being responsive.
> >=20
> > I'm not sure if the problem is due to:
> >=20
> > *         Network interruption causing Totem failures.
> > *         Java (Tomcat) processes falling over.
>=20
> I suppose that those are activequote and activequoteadmin. You  should

>increase the timeouts, 10 seconds is too short in general,  and for 
>java/tomcat probably even more so.
>=20
> > *         DLM falling over.
> > *         Any of the above in any combination.
> >=20
> > I've attached a hb_report. Could you see if you can see anything?
>=20
> Any good reason to ignore quorum? For a three node cluster you  should

>remove the no-quorum-policy property or, perhaps because  of ocfs2, set

>it to freeze.
>=20
> Pacemaker is 1.0.3, perhaps it's time to upgrade too. There is a
> SLE11 HAE update available.
>=20
> From the logs:
>=20
> Mar  9 06:28:43 OGG-ACTIVEQUOTE-02 pengine: [5540]: WARN: 
>unpack_rsc_op: =
Processing failed op activequote:1_monitor_10000 on OGG-ACTIVEQUOTE-03:
unk= nown exec error
>=20
> Interestingly, there is no lrmd log for this on 03.
>=20
> Then there are several operation timeouts, perhaps due to ocfs2  
>hanging, two activequote and activequoteadmin stop operations  could 
>not be killed even with -9, so they were probably waiting  for the 
>disk.
>=20
> Mar  9 06:29:40 OGG-ACTIVEQUOTE-02 openais[5439]: [crm  ] info: 
>pcmk_peer=
_update: lost: OGG-ACTIVEQUOTE-03 504997642
>=20
> Do you know why the node vanished? You should try to keep your  
>networking healthy.
>=20
> Thanks,
>=20
> Dejan
>=20
> > =20
> >=20
> > Thanks
> >=20
> > Darren Mansell
> >=20
> >=20
> >=20
> > =20
> >=20
>=20
>=20
> > _______________________________________________
> > Pacemaker mailing list
> > Pacemaker at oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>=20
>=20
> _______________________________________________
> Pacemaker mailing list
> Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker





More information about the Pacemaker mailing list