[Pacemaker] Help with OCFS2 / DLM Stability
    Darren.Mansell at opengi.co.uk 
    Darren.Mansell at opengi.co.uk
       
    Wed Mar 10 13:52:47 UTC 2010
    
    
  
On Wed, 2010-03-10 at 13:28 +0100, Dejan Muhamedagic wrote:
> Hi,
>=20
> On Tue, Mar 09, 2010 at 11:37:02AM -0000, Darren.Mansell at opengi.co.uk wro=
te:
> > Hi everyone.
> >=20
> > =20
> >=20
> > Further to some discussions a couple of weeks ago with regard to OCFS2
> > on SLES 11 HAE I'm looking to finally nail this problem.
> >=20
> > We have a 3 node cluster that has a STONITH shootout every week. This
> > morning one node got stuck in a state where it couldn't be fenced due
> > the RSA not being responsive.
> >=20
> > I'm not sure if the problem is due to:
> >=20
> > *         Network interruption causing Totem failures.
> > *         Java (Tomcat) processes falling over.
>=20
> I suppose that those are activequote and activequoteadmin. You
> should increase the timeouts, 10 seconds is too short in general,
> and for java/tomcat probably even more so.
>=20
> > *         DLM falling over.
> > *         Any of the above in any combination.
> >=20
> > I've attached a hb_report. Could you see if you can see anything?
>=20
> Any good reason to ignore quorum? For a three node cluster you
> should remove the no-quorum-policy property or, perhaps because
> of ocfs2, set it to freeze.
>=20
> Pacemaker is 1.0.3, perhaps it's time to upgrade too. There is a
> SLE11 HAE update available.
>=20
> From the logs:
>=20
> Mar  9 06:28:43 OGG-ACTIVEQUOTE-02 pengine: [5540]: WARN: unpack_rsc_op: =
Processing failed op activequote:1_monitor_10000 on OGG-ACTIVEQUOTE-03: unk=
nown exec error
>=20
> Interestingly, there is no lrmd log for this on 03.
>=20
> Then there are several operation timeouts, perhaps due to ocfs2
> hanging, two activequote and activequoteadmin stop operations
> could not be killed even with -9, so they were probably waiting
> for the disk.
>=20
> Mar  9 06:29:40 OGG-ACTIVEQUOTE-02 openais[5439]: [crm  ] info: pcmk_peer=
_update: lost: OGG-ACTIVEQUOTE-03 504997642
>=20
> Do you know why the node vanished? You should try to keep your
> networking healthy.
>=20
> Thanks,
>=20
> Dejan
>=20
> > =20
> >=20
> > Thanks
> >=20
> > Darren Mansell
> >=20
> >=20
> >=20
> > =20
> >=20
>=20
>=20
> > _______________________________________________
> > Pacemaker mailing list
> > Pacemaker at oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>=20
>=20
> _______________________________________________
> Pacemaker mailing list
> Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
-------------- next part --------------
A non-text attachment was scrubbed...
Name: winmail.dat
Type: application/ms-tnef
Size: 3300 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20100310/a61e81d9/attachment-0004.bin>
    
    
More information about the Pacemaker
mailing list