[Pacemaker] Frequent SBD triggered server reboots

Fri May 3 10:17:03 UTC 2013

Hi Emmanuel,

>> look this parameters fast_io_fail_tmo,dev_loss_tmo maybe your watchdog
timeout is too low and you took for more the 10 seconds

from multipathd -k and show config I can see the values are:

fast_io_fail_tmo 5
dev_loss_tmo 10

I'm recreating the SBD partition using 20 seconds for watchdog and 40
seconds for msgwait on one of the clusters, with these logging parameters
enabled: SBD latency (SBD_OPTS="-W -v"), Qlogic hba
(ql2xextended_error_logging=1), and scsi operations (echo 9411 >
/proc/sys/dev/scsi/logging_level): at least I should see something more on
the syslog console if/when the servers get rebooted by the watchdog, most of
all if during the 20 seconds countdown Oracle, under monitoring too, is
actively using its partition (on the same LUN as the SBD partition) or is
stuck on I/O access.

Thanks,

andrea

------------------------------

Message: 6
Date: Fri, 3 May 2013 09:52:19 +0200
From: emmanuel segura <emi2fast at gmail.com>
To: The Pacemaker cluster resource manager
	<pacemaker at oss.clusterlabs.org>
Subject: Re: [Pacemaker] Frequent SBD triggered server reboots
Message-ID:
	<CAE7pJ3CX2ii5Wh8f_ZYB6m6pM0yNrnUioGYY-b9+eiv25303_Q at mail.gmail.com>
Content-Type: text/plain; charset="iso-8859-1"

Hello Andrea

When you use multipath -v3 look this parameters
fast_io_fail_tmo,dev_loss_tmo maybe your watchdog timeout is too low and you
took for more the 10 seconds

Thanks