[Pacemaker] Frequent SBD triggered server reboots

Thu May 2 10:11:11 EDT 2013

Hi,

It's my first try at asking for help on a mailing list, I hope I'll not make
netiquette mistakes. I really could use some help on SBD, here's my
scenario:

I have three clusters with a similar configuration: two physical servers
with a fibre channel shared storage, 4 resources (ip address, ext3
filesystem, oracle listener, oracle database) configured in a group, and
external\SBD as stonith device. Operating system, is SLES 11 Sp1, cluster
components come from the SLES Sp1 HA package and are these versions:

openais: 1.1.4-5.6.3

pacemaker: 1.1.5-5.9.11.1

resource-agents: 3.9.3-0.4.26.1

cluster-glue: 1.0.8-0.4.4.1

corosync: 1.3.3-0.3.1

csync2: 1.34-0.2.39

Each one of the three clusters will work fine for a couple of days, then
both servers of one of the clusters at the same time will start the SBD
"WARN: Latency: No liveness for" countdown and restart. It happens at
different hours, and during different servers load (even at night, when
servers are close to 0% load). No two clusters have ever went down at the
same time. Their syslog is superclean, the only warning messages before the
reboots are the ones telling the SBD liveness countdown. The SAN department
can’t see anything wrong on their side, the SAN is used by many other
servers, no-one seems to be experiencing similar problems.

Hardware

Cluster 1 and Cluster 2: two IBM blades, QLogic QMI2582 (one card, two
ports), Brocade blade center FC switch, SAN switch, HP P9500 SAN 

Cluster 3: two IBM x3650, QLogic QLE2560 (two cards per server), SAN switch,
HP P9500 SAN

Each cluster have a 50GB LUN on the HP P9500 SAN (the SAN is in common, the
LUNs are different): partition 1 (7.8 MB) for SBD, partition 2 (49.99 GB)
for Oracle on ext3

What I have done so far:

- introduced options qla2xxx ql2xmaxqdepth=16 qlport_down_retry=1
ql2xloginretrycount=5 ql2xextended_error_logging=1  in
/etc/modprobe.conf.local (and mkinitrd and restarted the servers)

- verified with the SAN department that the Qlogic firmware of my HBAs is
compliant with their needs

- configured multipath.conf as per HP specifications for the OPEN-V type of
SAN

- verified multipathd is working as expected, shutting down one port at a
time, links stay up on the other port, and then shutting down both, cluster
switches on the other node

- configured SBD to use the watchdog device (softdog), and the first
partition of the LUN, and all relevant tests confirm SBD is working as
expected (list, dump, message test, message exit, killing the SBD process
the server reboots), here's my /etc/sysconfig/SBD

server1:~ # cat /etc/sysconfig/SBD

SBD_DEVICE="/dev/mapper/san_part1"

SBD_OPTS="-W"

- enhanced (x2) the default values for Timeout (watchdog) and Timeout
(msgwait), setting them at 10 and 20, while Stonith Timeout is 60s

server1:~ # SBD -d /dev/mapper/san_part1 dump

==Dumping header on disk /dev/mapper/san_part1

Header version     : 2

Number of slots    : 255

Sector size        : 512

Timeout (watchdog) : 10

Timeout (allocate) : 2

Timeout (loop)     : 1

Timeout (msgwait)  : 20

==Header on disk /dev/mapper/san_part1 is dumped

I’ve even tested with 60 and 120 for Timeout (watchdog) and Timeout
(msgwait), when the problem happened again the serves went all through the
60 seconds delay countdown to reboot.

Borrowing the idea from here
http://www.gossamer-threads.com/lists/linuxha/users/79213 , I'm monitoring
access time on the SBD partition on the three clusters: average time to
execute the dump command is 30ms, sometimes it spikes over 100ms a couple of
times in an hour. There's no slow rise from the average when the problem
comes, though, here's what it looked like the last time, frequency of the
dump command is 2 seconds:

...

real    0m0.031s

real    0m0.031s

real    0m0.030s

real    0m0.030s

real    0m0.030s

real    0m0.030s

real    0m0.031s    ß-- last record on the file, no more logging, server
will reboot after the timeout watchdog period

...

Right before the last cluster reboot I was monitoring Oracle I/O towards its
datafiles, to verify whether Oracle could access its partition, on the same
LUN as the SBD one, when the SBD countdown start, to identify if it’s an
SBD-only problem or a LUN access problem), and there was no sign of  Oracle
I/O problems during the countdown, it seems Oracle stopped interacting with
the I/O monitor software the very moment the Oracle servers rebooted (all
servers involved have a common time-server, but I can’t be 100% sure they
were in sync when I checked).

I'm in close contact with the SAN department, the problem might well be the
servers losing access to the LUN for some fibre channel matter they still
can't see in their SAN logs, but I'd like to be 100% certain the cluster
configuration is good. Here are my SBD related questions:

- is the 1 MB size for the SBD partition strictly mandatory ? in SLES 11 Sp1
HA documentation it's written: "In an environment where all nodes have
access to shared storage, a small partition (1MB) is formated for the use
with SBD", while here http://linux-ha.org/wiki/SBD_Fencing there is no size
suggested for it. At Os setup the SLES partitioner didn't allow us to create
a 1MB partition, being it too small, the smallest size available was 7.8MB:
can this difference in size introduce the random problem we're experiencing
? 

- I've read here
http://www.gossamer-threads.com/lists/linuxha/pacemaker/84951 Mr. Lars
Marowsky-Bree says: "The new SBD versions will not become stuck on IO
anymore". Is the SBD version I'm using one that can become stuck on IO ?
I've checked without luck for SLES HA packages newer than the one I'm using,
but the SBD being stuck on IO really seems something that would apply to my
case.

Thanks and best regards.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/pacemaker/attachments/20130502/8b93477a/attachment-0002.html>