[ClusterLabs] SDB msgwait & partner reboot time

Tue Sep 8 11:45:32 EDT 2015

Hi,

I've read about how important is the relationship between the different
parameters of the SBD device (msgwait & watchdog timeout) & Pacemaker's
stonith timeout.  However I've just encountered something that I never
considered:  the time elapsed until a node is fully up (after being
fenced) against msgwait.

Two nodes: sles11a & sles11b.  I fenced sles11a (via Hawk's interface
that triggers the sbd resource agent) and watched carefully
/var/log/messages on sles11b:

Sept 8 11:27:00 sles11b  sbd: Writing reset to node slot sles11a
Sept 8 11:27:00 sles11b  sbd: Messaging delay: 40

[sles11a is rebooting and it comes up in about 12 seconds]

[see a bunch of messages joining the cluster]

[finally node sles11a is online at about 11:27:25]

Sept 8 11:27:40 sles11b sbd: Message successfully delivered

[sles11a is put offline!]

Sept 8 11:27:41 pengine[4358]: warning: custom_action: Action
p_stonith-sdb_monitor_0 on sles11a
 is unrunnable (pending)

I've done it about 5 times and it happens every time.

My values are: 20 (watchdog timeout) & 40 (msgwait).  I know I
know..it's too much for my lab environment but I'm just curious if
there's something wrong or if indeed msgwait NEEDS to be ALWAYS less
than reboot-time.

Thanks,
Jorge