[Pacemaker] Frequent SBD triggered server reboots

Thu May 2 20:49:54 EDT 2013

Thanks Lars,

> SP1? That's no longer supported, and the overlapping support period to
> SP2 long since expired. You really want to update to SP2+maintenance
updates.

Unfortunately Os and SP version for the Oracle project these clusters belong
to have been decided several layers over my head, I'll make it a point for
upgrading to Sp2 anyway, I might get lucky. By this time I've taken an
unskilled look at sbd.c and put a -v in the /etc/sysconfig/sbd file of a
non-production cluster, enjoying the latency details In the syslog. While
the SAN department investigate their side of the problem, I'll take a look
at trying a different stonith resources, all servers involved have some kind
of IBM management console. Thanks for your answers to my questions and for
your time, very much appreciated.

andrea

------------------------------

Message: 4
Date: Thu, 2 May 2013 23:36:42 +0200
From: Lars Marowsky-Bree <lmb at suse.com>
To: The Pacemaker cluster resource manager
	<pacemaker at oss.clusterlabs.org>
Subject: Re: [Pacemaker] Frequent SBD triggered server reboots
Message-ID: <20130502213642.GC3705 at suse.de>
Content-Type: text/plain; charset=iso-8859-1

On 2013-05-02T16:11:11, andrea cuozzo <andrea.cuozzo at sysma.it> wrote:

> external\SBD as stonith device. Operating system, is SLES 11 Sp1, 
> cluster components come from the SLES Sp1 HA package and are these
versions:

SP1? That's no longer supported, and the overlapping support period to
SP2 long since expired. You really want to update to SP2+maintenance
updates.

> Each one of the three clusters will work fine for a couple of days, 
> then both servers of one of the clusters at the same time will start 
> the SBD
> "WARN: Latency: No liveness for" countdown and restart. It happens at 
> different hours, and during different servers load (even at night, 
> when servers are close to 0% load). No two clusters have ever went 
> down at the same time. Their syslog is superclean, the only warning 
> messages before the reboots are the ones telling the SBD liveness 
> countdown. The SAN department can?t see anything wrong on their side, 
> the SAN is used by many other servers, no-one seems to be experiencing
similar problems.

That's really strange.

Newer SBD versions cope much better with IO that gets stuck in the multipath
layer forever - they'll timeout, abort and most of the time recover. You
really want to upgrade.

In case just your one SBD partitions goes bad, you can also have three of
them, which obviously improves resilience (if they are on different
disks/channels, or connected via iSCSI/FCoE etc).

> - is the 1 MB size for the SBD partition strictly mandatory ? in SLES
> 11 Sp1 HA documentation it's written: "In an environment where all 
> nodes have access to shared storage, a small partition (1MB) is 
> formated for the use with SBD",

No, this is just the minimum size that SBD needs. You can make it larger if
you want to.

> http://www.gossamer-threads.com/lists/linuxha/pacemaker/84951 Mr. Lars 
> Marowsky-Bree says: "The new SBD versions will not become stuck on IO 
> anymore". Is the SBD version I'm using one that can become stuck on IO ?
> I've checked without luck for SLES HA packages newer than the one I'm 
> using, but the SBD being stuck on IO really seems something that would 
> apply to my case.

Yes. You really want to update, see the first paragraph. There are no newer
SBD versions for SP1. (If you have LTSS, the story may be different, but in
that case, kindly contact our support directly.)

Regards,
    Lars

--
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imend?rffer,
HRB 21284 (AG N?rnberg) "Experience is the name everyone gives to their
mistakes." -- Oscar Wilde